Project

General

Profile

Bug #7444

Updated by Brett Smith over 8 years ago

We use @docker run --rm@ to ensure that Docker containers are removed after tasks are finished, to prevent compute nodes from filling up with unused volumes.    However, "@docker run --rm@ is handled by the Docker client":https://github.com/docker/docker/issues/16575.    It simply makes the necessary API calls to remove the container after it exits. 

 Crunch's cancel code kills the Docker client.    If a user cancels a job, the container will hang around, along with its volumes.    We just had a situation where compute nodes on a cluster filled their @/tmp@ partitions, because a user was canceling many jobs, leaving it full of finished Docker containers and their large tmp volumes. 

 Make sure that when Crunch cancels a job, the corresponding Docker container is removed. 

 h2. Implementation 

 * Extend crunch-job to stop using @--rm@, and name containers after the task.    Append ".$try_number" to the name to avoid name collisions when tasks are retried. 
 * Extend the Docker cleaner service to listen for container stop events, and immediately destroy those containers.    Sysadmins who want to debug Docker on compute nodes are expected to stop the Docker cleaner service to do that.

Back