Bug #7444

[Crunch] Docker container not removed when job canceled, filling disk

Added by Brett Smith over 4 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Start date:
10/02/2015
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
2.0

Description

We use docker run --rm to ensure that Docker containers are removed after tasks are finished, to prevent compute nodes from filling up with unused volumes. However, docker run --rm is handled by the Docker client. It simply makes the necessary API calls to remove the container after it exits.

Crunch's cancel code kills the Docker client. If a user cancels a job, the container will hang around, along with its volumes. We just had a situation where compute nodes on a cluster filled their /tmp partitions, because a user was canceling many jobs, leaving it full of finished Docker containers and their large tmp volumes.

Make sure that when Crunch cancels a job, the corresponding Docker container is removed.

Implementation

  • Extend crunch-job to stop using --rm, and name containers after the task. Append ".$try_number" to the name to avoid name collisions when tasks are retried.
  • Extend the Docker cleaner service to listen for container stop events, and immediately destroy those containers. Sysadmins who want to debug Docker on compute nodes are expected to stop the Docker cleaner service to do that.

Subtasks

Task #7560: Remove unused containers in dockercleanerResolvedTom Clegg

Task #7686: Remove --rm flag in crunch-jobResolvedTom Clegg

Task #7693: Warn in install guide that dockercleaner will remove all stopped containers: think before installing, and stop it if you need to revive/debug containersResolvedTom Clegg

Task #7561: TestingResolvedTom Clegg

Task #7547: Review 7444-dockercleaner-containersResolvedTom Clegg

Associated revisions

Revision 1d1c6de3
Added by Tom Clegg over 4 years ago

Merge branch '7444-dockercleaner-containers' closes #7444

History

#1 Updated by Brett Smith over 4 years ago

  • Subject changed from [Crunch] Job containers not removed consistently, filling disk to [Crunch] Docker container not removed when job canceled, filling disk
  • Description updated (diff)

#2 Updated by Brett Smith over 4 years ago

  • Description updated (diff)
  • Story points set to 2.0

#3 Updated by Brett Smith over 4 years ago

  • Target version changed from Arvados Future Sprints to 2015-10-28 sprint

#4 Updated by Peter Amstutz over 4 years ago

  • Assigned To set to Peter Amstutz

#5 Updated by Brett Smith over 4 years ago

  • Target version changed from 2015-10-28 sprint to Arvados Future Sprints

#6 Updated by Tom Clegg over 4 years ago

  • Assigned To changed from Peter Amstutz to Tom Clegg
  • Target version changed from Arvados Future Sprints to 2015-11-11 sprint

#7 Updated by Tom Clegg over 4 years ago

Naming containers sounds like a good idea anyway, but seems tangential. Unless dockercleaner is supposed to pay attention to the names, perhaps in order to exempt non-Crunch containers from automatic removal...?

#8 Updated by Tom Clegg over 4 years ago

Should dockercleaner also delete all stopped containers that are already present when it starts up? This would help keep a long-running (e.g., bare metal) worker node clean.

If/when we do add this, I think it should have a separate command line flag, to support a workflow like
  1. Turn off dockercleaner
  2. Run a job
  3. Turn on dockercleaner --leave-existing-containers
  4. Inspect the container left behind by the above job, but let subsequent jobs get cleaned up

Until then, there's "docker ps --filter status=exited --format {{.ID}} | xargs docker rm".

#9 Updated by Tom Clegg over 4 years ago

7444-dockercleaner-containers @ e10ccab

7444-no-docker-rm @ 07beca7

#10 Updated by Brett Smith over 4 years ago

Tom Clegg wrote:

Naming containers sounds like a good idea anyway, but seems tangential.

You are right it is not necessary for the dockercleaner changes. I previously had an implementation idea based on naming containers predictably and having crunch-job remove them. This is basically a remnant of that—there was still interest in naming as a debugging aid.

Should dockercleaner also delete all stopped containers that are present when it starts up?

I'm interested in ops' opinion on this but my vote is yes.

#11 Updated by Tom Clegg over 4 years ago

Both changes (dockercleaner and crunch-job) are now in 7444-dockercleaner-containers at 9b48b17

#12 Updated by Tom Clegg over 4 years ago

  • Status changed from New to In Progress

#13 Updated by Nico César over 4 years ago

review 9b48b17eddea5e366e0c59ed9f3540793550256c

LGTM

#14 Updated by Tom Clegg over 4 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 80 to 100

Applied in changeset arvados|commit:1d1c6de3c842a33a57b7d469fdaaaa1b873433dc.

Also available in: Atom PDF