Project

General

Profile

Actions

Bug #18262

open

[crunch-run] handle out-of-diskspace on the compute node better

Added by Ward Vandewege about 3 years ago. Updated 10 months ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
Story points:
-
Release:
Release relationship:
Auto

Description

When a job consumes all available disk space on a compute node, and the node was not started with a particular scratch space requirement (i.e. no extra partition was added), bad things happen because the job fills up the root partition of the node.

In one example today, a workflow filled up the root partition (which was tiny) which caused /etc/resolv.conf to be emptied on the next dhcp renew (sigh), which caused crunch-run to be unable to find the api server and keepstores and had the effect that the container failed with truncated logs, and without explicitly being marked as such. It looked as if crunch-run was crashing until we caught the compute node in the act, which was a bit of a debugging adventure.

Can we somehow restrict the amount of disk space the container is allowed to use?

Actions #1

Updated by Ward Vandewege about 3 years ago

  • Description updated (diff)
Actions #2

Updated by Peter Amstutz almost 2 years ago

  • Release set to 60
Actions #3

Updated by Peter Amstutz 10 months ago

  • Target version set to Future
Actions

Also available in: Atom PDF