Bug #18262

[crunch-run] handle out-of-diskspace on the compute node better

Added by Ward Vandewege about 2 months ago. Updated about 2 months ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Story points:
-

Description

When a job consumes all available disk space on a compute node, and the node was not started with a particular scratch space requirement (i.e. no extra partition was added), bad things happen because the job fills up the root partition of the node.

In one example today, a workflow filled up the root partition (which was tiny) which caused /etc/resolv.conf to be emptied on the next dhcp renew (sigh), which caused crunch-run to be unable to find the api server and keepstores and had the effect that the container failed with truncated logs, and without explicitly being marked as such. It looked as if crunch-run was crashing until we caught the compute node in the act, which was a bit of a debugging adventure.

Can we somehow restrict the amount of disk space the container is allowed to use?

History

#1 Updated by Ward Vandewege about 2 months ago

  • Description updated (diff)

Also available in: Atom PDF