Project

General

Profile

Actions

Feature #17301

closed

Special case report exit_code 137 as likely out of memory error

Added by Peter Amstutz almost 4 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
CWL
Target version:
Story points:
-
Release relationship:
Auto

Description

One of the most common reasons for containers to fail by running out of memory and being OOM killed. When this happens the container exit code is 137. Arvados-cwl-runner should detect that and print a warning, workbench2 needs to display container warnings and errors similar to how it is already done with workbench 1.


Subtasks 1 (0 open1 closed)

Task #18956: Review 17301-cwl-oomResolvedPeter Amstutz04/20/2022Actions

Related issues

Related to Arvados Epics - Idea #16945: WB2 Workflows / containers feature parityResolved08/01/202103/31/2023Actions
Related to Arvados - Feature #18513: Print "exited from signal XY" for exit codes >128ResolvedTom Clegg01/18/2022Actions
Actions #1

Updated by Peter Amstutz almost 4 years ago

  • Category set to Workbench2
  • Description updated (diff)
Actions #2

Updated by Peter Amstutz almost 4 years ago

  • Target version set to 2021-02-17 sprint
Actions #3

Updated by Peter Amstutz almost 4 years ago

  • Related to Idea #16945: WB2 Workflows / containers feature parity added
Actions #4

Updated by Peter Amstutz almost 4 years ago

  • Release deleted (31)
  • Target version deleted (2021-02-17 sprint)
Actions #6

Updated by Peter Amstutz over 2 years ago

  • Target version set to 2022-03-30 Sprint
Actions #7

Updated by Peter Amstutz over 2 years ago

  • Target version changed from 2022-03-30 Sprint to 2022-04-13 Sprint
Actions #8

Updated by Peter Amstutz over 2 years ago

  • Related to Feature #18513: Print "exited from signal XY" for exit codes >128 added
Actions #9

Updated by Peter Amstutz over 2 years ago

  • Assigned To set to Peter Amstutz
Actions #10

Updated by Peter Amstutz over 2 years ago

  • Category changed from Workbench2 to CWL
Actions #11

Updated by Peter Amstutz over 2 years ago

  • Target version changed from 2022-04-13 Sprint to 2022-04-27 Sprint
Actions #12

Updated by Peter Amstutz over 2 years ago

  • Status changed from New to In Progress
Actions #13

Updated by Peter Amstutz over 2 years ago

17301-cwl-oom @ c22d90571a1fcb4b52e5387a791e3aefff5be6af

  • Add special message about exit code 137
  • Rework how runtime_status is updated, now takes the first line of the first message for the main message, and adds all subsequent messages in "details"

developer-run-tests: #3069

workbench re-run:

developer-run-tests-apps-workbench-integration: #3280

Actions #14

Updated by Lucas Di Pentima over 2 years ago

Reviewing c22d905

  • The code assumes that runtime_status['activityDetail'] is legal. Do we know if it's at least accepted in railsAPI/controller? (The documentation doesn't mention it)
  • The warning message seems to me a little too wordy. I was thinking that we could have an indexed documentation page where to point the user for broader explanations of the summarized messages that we display in WB2's UI. Food for thought, not sure if it should apply to this story.
  • At executor.py :
    • Line 264: That comment seems to be outdated now.
    • Line 268: There's a trailing semicolon.
  • If we're going to use runtime_status as some sort of logging store (as I understand, any error/warning will be appended to this field) we'll need to think how to handle long texts on WB2.
Actions #15

Updated by Peter Amstutz over 2 years ago

Lucas Di Pentima wrote:

Reviewing c22d905

  • The code assumes that runtime_status['activityDetail'] is legal. Do we know if it's at least accepted in railsAPI/controller? (The documentation doesn't mention it)

Since a-c-r never posts 'activity' status I just took it out.

  • The warning message seems to me a little too wordy. I was thinking that we could have an indexed documentation page where to point the user for broader explanations of the summarized messages that we display in WB2's UI. Food for thought, not sure if it should apply to this story.

I cut the text back to "Container may have been killed for using too much RAM. Try resubmitting with a higher 'ramMin'."

  • At executor.py :
    • Line 264: That comment seems to be outdated now.
    • Line 268: There's a trailing semicolon.

Fixed

  • If we're going to use runtime_status as some sort of logging store (as I understand, any error/warning will be appended to this field) we'll need to think how to handle long texts on WB2.

I added a 40 line limit to details.

17301-cwl-oom @ 332b0d1b4a9095f4e43893ec741f901b74b36ceb

developer-run-tests: #3071

Actions #16

Updated by Lucas Di Pentima over 2 years ago

Updates LGTM, but I don't understand why these tests failed: developer-run-tests-remainder: #3208 /console

Actions #17

Updated by Peter Amstutz over 2 years ago

This was annoying because it wasn't failing for me locally.

I fixed up the test cases to make sure RuntimeStatusLoggingHandler gets removed from the global logger.

developer-run-tests: #3072

Actions #18

Updated by Peter Amstutz over 2 years ago

  • Status changed from In Progress to Resolved
Actions #19

Updated by Peter Amstutz over 2 years ago

  • Release set to 51
Actions

Also available in: Atom PDF