Project

General

Profile

Actions

Bug #3793

closed

[Crunch] Docker containers not removed consistently, filling disk

Added by Brett Smith over 9 years ago. Updated almost 9 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Story points:
1.0

Description

When we start Docker containers for jobs, we pass the --rm=true option to Docker, which should remove the container image after it runs. Unfortunately, Docker does not consistently remove the container image. Over time, more and more will accumulate, until Docker's storage partition is full.

We haven't figured out any rhyme or reason to what containers aren't removed yet. It happens across several different images, and it happens to containers whether they finish successfully (exit status 0) or not.

We need to decide on and implement additional container clean-up strategies—either as a complement or replacement to --rm=true.

There's a new option that is not in a released version that we should explore (force-rm). It should go into Docker 1.2.1


Subtasks 5 (0 open5 closed)

Task #6003: Review arvados-dev branch 3793-docker-cleaner-wipResolvedWard Vandewege05/12/2015Actions
Task #5857: Review 3793-clean-docker-images-wipResolvedPeter Amstutz04/30/2015Actions
Task #5996: Test Docker image cleaner in run-tests.shResolvedBrett Smith05/11/2015Actions
Task #5997: Deploy Docker image cleaner via PuppetResolvedBrett Smith05/11/2015Actions
Task #6056: Review puppet branch 3793-docker-image-cleaner-wipResolvedWard Vandewege05/18/2015Actions
Actions #1

Updated by Brett Smith over 9 years ago

This is Docker bug 6757. It was fixed by adding a --force-rm option last month, so we may just need to upgrade and add that switch. (Edit: Maybe not, see note-5.)

Actions #2

Updated by Ward Vandewege over 9 years ago

  • Target version set to Arvados Future Sprints
Actions #3

Updated by Ward Vandewege over 9 years ago

  • Description updated (diff)
Actions #4

Updated by Tom Clegg over 9 years ago

  • Story points set to 1.0
Actions #5

Updated by Brett Smith over 9 years ago

A task in job qr1hi-8i9sb-98ermskay1azgri got this message:

2014/09/23 18:03:25 Error response from daemon: Cannot destroy container 9bc652e2d856e4f9919cced05e219b4e7ada04c18582edf5ddfb323b16d536ef: Driver aufs failed to remove root filesystem 9bc652e2d856e4f9919cced05e219b4e7ada04c18582edf5ddfb323b16d536ef: rename /tmp/docker/aufs/mnt/9bc652e2d856e4f9919cced05e219b4e7ada04c18582edf5ddfb323b16d536ef /tmp/docker/aufs/mnt/9bc652e2d856e4f9919cced05e219b4e7ada04c18582edf5ddfb323b16d536ef-removing: device or resource busy

The compute node in question has several exited, unremoved containers on it from around the same time. Given that and the target name, this definitely looks like it could be at least one source of the failure we're seeing. But why would Docker try to remove a busy layer? Race condition in the teardown process? A separate container still using it? Something else?

This looks like Docker bug 3968, which is currently open and still seeing activity.

Shortly after, the whole job was also bit by #3433. Not sure if that's related, but it seemed worth noting, at least.

Actions #6

Updated by Ward Vandewege over 9 years ago

  • Target version changed from Arvados Future Sprints to 2014-10-29 sprint
Actions #7

Updated by Tom Clegg over 9 years ago

Is it worth trying a workaround, like running "docker ps -qa | xargs docker rm" during crunch-job's "clean temp before running any tasks" phase?

Actions #8

Updated by Brett Smith over 9 years ago

Tom Clegg wrote:

Is it worth trying a workaround, like running "docker ps -qa | xargs docker rm" during crunch-job's "clean temp before running any tasks" phase?

I think that depends on whether or not we decide to upgrade Docker to try to address some of the other bugs we've stumbled upon. If we're going to upgrade, I think we should do that first. If it doesn't provide an easy fix to this, then I agree that a workaround is probably the next logical step.

Actions #9

Updated by Tom Clegg over 9 years ago

  • Target version changed from 2014-10-29 sprint to Arvados Future Sprints
Actions #10

Updated by Tom Clegg about 9 years ago

  • Target version changed from Arvados Future Sprints to 2015-05-20 sprint
Actions #11

Updated by Tom Clegg about 9 years ago

https://forums.docker.com/t/command-to-remove-all-unused-images/20/5

Suggestion: cron job
  • Check `df`, do nothing if there's lots of free space
  • List docker images, sort by last use if possible, delete some (half?) of them with docker rmi
Actions #12

Updated by Tom Clegg about 9 years ago

  • Target version changed from 2015-05-20 sprint to 2015-04-29 sprint
Actions #13

Updated by Brett Smith about 9 years ago

  • Status changed from New to In Progress
  • Assigned To set to Brett Smith
Actions #14

Updated by Brett Smith almost 9 years ago

  • Target version changed from 2015-04-29 sprint to 2015-05-20 sprint
Actions #15

Updated by Brett Smith almost 9 years ago

3793-clean-docker-images-wip is up for review. A few notes:

  • THIS IS PYTHON 3. I asked Ward if this was okay and he didn't say no. python setup.py test will break. Try python3 setup.py test instead.
  • This is intended to run as a service rather than a cron job. In my personal experience, cron jobs are a bad match for tasks like this: either you don't run the job often enough, and your disk fills when you get slammed; or you waste a bunch of resources running the job too often. Listening and responding to Docker events lets us react more appropriately. It's also the only place to get "when was this image last used?" information from Docker.
  • This service only removes images when there are zero containers in Docker. The rationale here is to avoid fighting for disk I/O with running compute jobs. However, it does mean that if the original bug reported in this issue (where --rm doesn't do what it says) persists, this service will never remove images.
    Docker bug 3968 suggests that this has been fixed in more recent Docker, and we should be upgrading to 1.5 anyway to support #5642. So I'm inclined to make this a deployment issue for now. If we upgrade to 1.5 and we're still having issues, then we can make the code smarter to listen to start/stop events and distinguish between created containers and running containers.
Actions #16

Updated by Peter Amstutz almost 9 years ago

4dc84fd

  • Per office conversation, remove if self.images.any_users(): return from clean_images() since a node that is being used continuously (scheduled with overlapping jobs) may not have downtime between jobs that would allow the cleaner to delete stuff.
  • Request some comments in should_delete explaining the logic
  • What happens if an image has been loaded but has not yet used? Is it a candidate for immediate deletion?
  • Can you track the timestamps when images are added to the node? Then the LRU logic could use max(added_at, used_at)
  • Suggest sorting on lru_images.sort(key=lambda image: -image.last_used) instead of sorting the wrong direction and having to reverse the list
  • Suggest combining the two for loops in should_delete():
            lru_images.sort(key=lambda image: -image.last_used)
            for image in lru_images:
                if image.size <= space_left:
                    space_left -= image.size
                else:
                    yield image.docker_id
    
  • Suggest some fallback logic like "always keep at least one image" to avoid breaking in the case where there is a single very large image that exceeds the limit.
Actions #17

Updated by Brett Smith almost 9 years ago

Peter Amstutz wrote:

4dc84fd

  • Per office conversation, remove if self.images.any_users(): return from clean_images() since a node that is being used continuously (scheduled with overlapping jobs) may not have downtime between jobs that would allow the cleaner to delete stuff.
  • Request some comments in should_delete explaining the logic

Both done.

  • Suggest sorting on lru_images.sort(key=lambda image: -image.last_used) instead of sorting the wrong direction and having to reverse the list
  • Suggest combining the two for loops in should_delete():

The two for loops iterate over the list in different directions. It finds images to save from the most recently used, but then yields candidates for deletion least recently used first. The current deletion loop isn't doing anything smart with that fact, but it seems worth keeping if we want to add smarts later like throttling deletions or something.

  • What happens if an image has been loaded but has not yet used? Is it a candidate for immediate deletion?
  • Can you track the timestamps when images are added to the node? Then the LRU logic could use max(added_at, used_at)

Unfortunately, Docker does not make image load information available to clients. All the data provided by image inspection comes from the image itself, and nothing about its place in the system. And the only events reported for Docker images are untag and delete, nothing about loading or creation.

Because of that, the code only loads data about images at first boot (DockerImages.from_daemon) and when they're first used (DockerImageCleaner.new_container). This means that the main loop won't delete images immediately, because the only way it finds out about images is because they're is use. Fast deletion can happen to images loaded at boot if there's no event information for them, but there's not much we can do in that case anyway. Since Docker events are the only source of image use information, if the events are gone, we don't have a data source to make decisions from.

  • Suggest some fallback logic like "always keep at least one image" to avoid breaking in the case where there is a single very large image that exceeds the limit.

Avoid breaking what? Bearing in mind that we only find out about images when they're already in use, it's not clear to me that keeping a very large image around is actually the right policy—it could be eating into other users' scratch space.

Now at df5a73a. Thanks.

Actions #18

Updated by Peter Amstutz almost 9 years ago

Brett Smith wrote:

  • What happens if an image has been loaded but has not yet used? Is it a candidate for immediate deletion?
  • Can you track the timestamps when images are added to the node? Then the LRU logic could use max(added_at, used_at)

Unfortunately, Docker does not make image load information available to clients. All the data provided by image inspection comes from the image itself, and nothing about its place in the system. And the only events reported for Docker images are untag and delete, nothing about loading or creation.

Because of that, the code only loads data about images at first boot (DockerImages.from_daemon) and when they're first used (DockerImageCleaner.new_container). This means that the main loop won't delete images immediately, because the only way it finds out about images is because they're is use. Fast deletion can happen to images loaded at boot if there's no event information for them, but there's not much we can do in that case anyway. Since Docker events are the only source of image use information, if the events are gone, we don't have a data source to make decisions from.

I see, that is unfortunate. So if I understand correctly, there is a potential race between when an image is loaded (but not yet used), cleaner starts, and something tries to use the image after the cleaner deletes the (unused) image. Once the cleaner is running, it only learns about images from container events, so it will not delete an image that has been loaded but not yet used, but then it is also possible to load up a bunch of images which don't get used and waste disk space, and the cleaner won't do anything about it.

The first case has a pretty small window if the cleaner is a system service that starts at boot and is running all the time.

The second case is plausible in the event of Crunch misbehaving, if it were to load images but fail to successfully run the container.

Neither of these cases are likely in "normal" operations.

So I agree with keeping the current behavior, however I suggest adding some code to the event handlers that polls the images list and logs a list of the images (if any) that are not managed by cleaner.

I'm trying to test it by just running it locally, and I'm getting a version mismatch:

$ python3 cleaner.py --keep 2gb
Traceback (most recent call last):
  File "/usr/local/lib/python3.4/dist-packages/docker/client.py", line 120, in _raise_for_status
    response.raise_for_status()
  File "/usr/local/lib/python3.4/dist-packages/requests/models.py", line 851, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "cleaner.py", line 244, in <module>
    main(sys.argv[1:])
  File "cleaner.py", line 241, in main
    run(args, docker.Client())
  File "cleaner.py", line 228, in run
    images = DockerImages.from_daemon(args.keep, docker_client)
  File "cleaner.py", line 57, in from_daemon
    for image in docker_client.images():
  File "/usr/local/lib/python3.4/dist-packages/docker/client.py", line 636, in images
    True)
  File "/usr/local/lib/python3.4/dist-packages/docker/client.py", line 126, in _result
    self._raise_for_status(response)
  File "/usr/local/lib/python3.4/dist-packages/docker/client.py", line 122, in _raise_for_status
    raise errors.APIError(e, response, explanation=explanation)
docker.errors.APIError: 404 Client Error: Not Found ("b"client and server don't have same version (client : 1.18, server: 1.15)"")
$ docker version
Client version: 1.3.3
Client API version: 1.15
Go version (client): go1.3.3
Git commit (client): d344625
OS/Arch (client): linux/amd64
Server version: 1.3.3
Server API version: 1.15
Go version (server): go1.3.3
Git commit (server): d344625

(docker 1.3.3 seems to be the latest version available for Jessie)

Actions #19

Updated by Brett Smith almost 9 years ago

Peter Amstutz wrote:

I see, that is unfortunate. So if I understand correctly, there is a potential race between when an image is loaded (but not yet used), cleaner starts, and something tries to use the image after the cleaner deletes the (unused) image.

In general, crunch-job has a time-of-check versus time-of-use race condition around Docker images. At the start of a job, each node checks if it has the Docker image, and loads it if not. From then on, crunch-job assumes it can run containers in that image, without any error checking.

I realized this is the breakage you were alluding to earlier: we can't have a situation where a job with many tasks uses a large Docker image, and the cleaner removes it from a node between tasks. I have implemented your always-keep-one-image suggestion.

Once the cleaner is running, it only learns about images from container events, so it will not delete an image that has been loaded but not yet used, but then it is also possible to load up a bunch of images which don't get used and waste disk space, and the cleaner won't do anything about it.

The first case has a pretty small window if the cleaner is a system service that starts at boot and is running all the time.

That's the plan. When we deploy this, we could even order things to make sure this service is started in between the Docker daemon and SLURM. Done right, that should prevent the issue entirely.

The second case is plausible in the event of Crunch misbehaving, if it were to load images but fail to successfully run the container.

Neither of these cases are likely in "normal" operations.

It's also worth noting that the container just has to be created for the cleaner service to see it. So even buggy job specs (e.g., the Crunch script is missing or not executable) should not prevent the cleaner service from operating normally.

So I agree with keeping the current behavior, however I suggest adding some code to the event handlers that polls the images list and logs a list of the images (if any) that are not managed by cleaner.

Done. Now at b15e6c6.

I'm trying to test it by just running it locally, and I'm getting a version mismatch:

[...]

(docker 1.3.3 seems to be the latest version available for Jessie)

Docker isn't in jessie at all; it was removed at the maintainer's request due to the rapid release cycle.

We were already committed to using a newer version with Arvados anyway, to support features like --memory-swap. We just deployed Docker 1.6 and that's on apt.arvados.org. Please upgrade and try again.

Thanks.

Actions #20

Updated by Peter Amstutz almost 9 years ago

Minor comment, suggest renaming the command line options from "--keep" to something else (maybe "--limit"), since "keep" is a special word in the Arvados world (even if dockercleaner doesn't do anything Arvados specific).

I tried running it locally again:

$ python -marvados_docker.cleaner --keep 50G
Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/home/peter/work/arvados/services/dockercleaner/arvados_docker/cleaner.py", line 260, in <module>
    main(sys.argv[1:])
  File "/home/peter/work/arvados/services/dockercleaner/arvados_docker/cleaner.py", line 257, in main
    run(args, docker.Client())
  File "/home/peter/work/arvados/services/dockercleaner/arvados_docker/cleaner.py", line 246, in run
    images, docker_client, docker_client.events(since=1, until=start_time))
TypeError: events() got an unexpected keyword argument 'since'
$ docker version
Client version: 1.6.0
Client API version: 1.18
Go version (client): go1.4.2
Git commit (client): 4749651
OS/Arch (client): linux/amd64
Server version: 1.6.0
Server API version: 1.18
Go version (server): go1.4.2
Git commit (server): 4749651
OS/Arch (server): linux/amd64
Actions #21

Updated by Peter Amstutz almost 9 years ago

docker_py-1.2.1-py3.4.egg

Actions #22

Updated by Peter Amstutz almost 9 years ago

My fault, I ran it with Python 2.7 instead of Python3.

However I note that cleaner.py has a #!/usr/bin/env python3 stanza but isn't +x

Actions #23

Updated by Peter Amstutz almost 9 years ago

Finally I should note that the original story was about removing containers and not images, it may be necessary to have logic to clean up containers because you can't delete images if a container still references them.

Actions #24

Updated by Nico César almost 9 years ago

Does adding this at the beginning of the script make sense?

if sys.version_info0 < 3:
##exit gracefully...

Actions #25

Updated by Peter Amstutz almost 9 years ago

I am running it now and it doesn't fail, but it doesn't appear to be doing anything either:

$ python3 -marvados_docker.cleaner --keep 1G

I have lots of images, but as far as I can tell, nothing has been deleted so far. It's possible that it is stuck iterating over the entire event history from the beginning of time. At minimum there needs to be more logging indicating what it is doing.

Actions #26

Updated by Peter Amstutz almost 9 years ago

Nico Cesar wrote:

Does adding this at the beginning of the script make sense?

> if sys.version_info[0] < 3:
>    ##exit gracefully...
> 

Yes, I think that's a good idea

Actions #27

Updated by Peter Amstutz almost 9 years ago

Here's the traceback after waiting ~5 minutes with no indication that it is doing anything, and then hitting ^C:

$ python3 -marvados_docker.cleaner --keep 1G
^CTraceback (most recent call last):
  File "/usr/lib/python3.4/runpy.py", line 170, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.4/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/peter/work/arvados/services/dockercleaner/arvados_docker/cleaner.py", line 260, in <module>
    main(sys.argv[1:])
  File "/home/peter/work/arvados/services/dockercleaner/arvados_docker/cleaner.py", line 257, in main
    run(args, docker.Client())
  File "/home/peter/work/arvados/services/dockercleaner/arvados_docker/cleaner.py", line 252, in run
    cleaner.run()
  File "/home/peter/work/arvados/services/dockercleaner/arvados_docker/cleaner.py", line 149, in run
    for event in self.events:
  File "/usr/local/lib/python3.4/dist-packages/docker/client.py", line 194, in _stream_helper
    data = reader.read(1)
  File "/usr/local/lib/python3.4/dist-packages/requests/packages/urllib3/response.py", line 243, in read
    data = self._fp.read(amt)
  File "/usr/lib/python3.4/http/client.py", line 500, in read
    return super(HTTPResponse, self).read(amt)
  File "/usr/lib/python3.4/http/client.py", line 529, in readinto
    return self._readinto_chunked(b)
  File "/usr/lib/python3.4/http/client.py", line 614, in _readinto_chunked
    chunk_left = self._read_next_chunk_size()
  File "/usr/lib/python3.4/http/client.py", line 552, in _read_next_chunk_size
    line = self.fp.readline(_MAXLINE + 1)
  File "/usr/lib/python3.4/socket.py", line 374, in readinto
    return self._sock.recv_into(b)
KeyboardInterrupt
Actions #28

Updated by Nico César almost 9 years ago

I can help you with a tcpdump capture maybe we can get better clues

Actions #29

Updated by Brett Smith almost 9 years ago

I can change the executable bit and the flag name. Let me take a moment to address some of the points that have come up in the discussion here to see if we can sort of settle down a bit on what else needs to be done to get this merged. Ward has been saying for over a week that this is blocking a deployment, so I want to try to do what I can to keep the number of review rounds down.

  • Containers vs. images: I agree, although to the extent that the original report is about compute nodes running out of space, images are the primary concern and cleaning up containers is more a prerequisite to cleaning up images. (I think I was a little too new to Docker to be clear about the difference when I filed this.) Per #3793-16, I'm hoping that our recent Docker upgrade will resolve the containers issue for us. If not, you're right that we'll have to make this smarter, but I think we at least have time to wait and see on this.
  • The test run: By default, the code only logs errors, so it's working as implemented that you didn't get any output in your test run. Adding -v will provide more logging, up to -vvv for full debugging. If you want to bikeshed this, please let me know exactly what you want as far as defaults and command-line switches, so we can get this squared away quickly. The traceback indicates that it got to the point of waiting for a new Docker event, so it seems to be running okay. Is it possible that all your images are in use by stopped containers (docker ps -a)? That would prevent it from deleting images.
  • Version checking: This will only be effective as long as the code compiles in Python 2; i.e., it doesn't use any syntax unique to Python 3 like yield from. That happens to be true now, but I'm not sure it's worth the trade-off to burden future development with Python 2 compatibility for something that I think, ultimately, is just a transitional growing pain. If you strongly feel otherwise, though, I'll add it.

Thanks.

Actions #30

Updated by Peter Amstutz almost 9 years ago

On further testing, I see it working now. No further comments. Please merge #3793

Actions #31

Updated by Ward Vandewege almost 9 years ago

Reviewing 3793-docker-cleaner-wip in the arvados-dev repository:

  • I added a commit to fix a bug (the path to PKG-INFO was wrong)
  • The package failed to build for me because I didn't have python3-setuptools installed. It would be nice to add some testing for the presence of python3 and python3's setuptools. Is that hard to do?

Otherwise, LGTM

Actions #32

Updated by Brett Smith almost 9 years ago

Ward Vandewege wrote:

Reviewing 3793-docker-cleaner-wip in the arvados-dev repository:

  • I added a commit to fix a bug (the path to PKG-INFO was wrong)

Thanks for catching that.

  • The package failed to build for me because I didn't have python3-setuptools installed. It would be nice to add some testing for the presence of python3 and python3's setuptools. Is that hard to do?

Nope, we can just check we can run easy_install3 just like we do with fpm. Added at commit:eecd977. Thanks.

Actions #33

Updated by Ward Vandewege almost 9 years ago

Thanks, LGTM!

Actions #34

Updated by Ward Vandewege almost 9 years ago

reviewing the puppet branch 3793-docker-image-cleaner-wip:

  • I think the compute_docker_image_quota default should be more than 20G, given the size of our scratch space (usually, hundreds of GiB). Maybe 50G is a better default.

Other than that, LGTM!

Actions #35

Updated by Brett Smith almost 9 years ago

  • Status changed from In Progress to Resolved

Ward Vandewege wrote:

reviewing the puppet branch 3793-docker-image-cleaner-wip:

  • I think the compute_docker_image_quota default should be more than 20G, given the size of our scratch space (usually, hundreds of GiB). Maybe 50G is a better default.

Other than that, LGTM!

Fixed up and merged to master. Thanks.

Actions

Also available in: Atom PDF