Project

General

Profile

Actions

Bug #7444

closed

[Crunch] Docker container not removed when job canceled, filling disk

Added by Brett Smith over 8 years ago. Updated over 8 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Story points:
2.0

Description

We use docker run --rm to ensure that Docker containers are removed after tasks are finished, to prevent compute nodes from filling up with unused volumes. However, docker run --rm is handled by the Docker client. It simply makes the necessary API calls to remove the container after it exits.

Crunch's cancel code kills the Docker client. If a user cancels a job, the container will hang around, along with its volumes. We just had a situation where compute nodes on a cluster filled their /tmp partitions, because a user was canceling many jobs, leaving it full of finished Docker containers and their large tmp volumes.

Make sure that when Crunch cancels a job, the corresponding Docker container is removed.

Implementation

  • Extend crunch-job to stop using --rm, and name containers after the task. Append ".$try_number" to the name to avoid name collisions when tasks are retried.
  • Extend the Docker cleaner service to listen for container stop events, and immediately destroy those containers. Sysadmins who want to debug Docker on compute nodes are expected to stop the Docker cleaner service to do that.

Subtasks 5 (0 open5 closed)

Task #7560: Remove unused containers in dockercleanerResolvedTom Clegg10/02/2015Actions
Task #7686: Remove --rm flag in crunch-jobResolvedTom Clegg10/02/2015Actions
Task #7693: Warn in install guide that dockercleaner will remove all stopped containers: think before installing, and stop it if you need to revive/debug containersResolvedTom Clegg10/02/2015Actions
Task #7561: TestingResolvedTom Clegg10/02/2015Actions
Task #7547: Review 7444-dockercleaner-containersResolvedTom Clegg10/02/2015Actions
Actions

Also available in: Atom PDF