Project

General

Profile

Actions

Bug #5642

closed

[Crunch] subprocesses get SIGKILL when they exceed Docker memory limit on compute nodes

Added by Brett Smith over 9 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Story points:
0.5

Description

You can reproduce this outside Crunch, using any Docker image:

compute2.qr1hi:/home/brettadmin# docker run -ti --memory=4M <IMAGE> bash
crunch@e5291ea9db1c:/$ dash
$ x=x
$ seq 1 10000000 | while read n; do echo $n; x=$x$x; done
1
2
…
21
22
Killed
$ echo $?
137

We expected code to get access to swap after exceeding the memory limit, but that's clearly not happening on the compute nodes. One possible improvement would be to upgrade Docker. #5425-10 suggests that is observed behavior from older Docker, and that more recent Docker is better about this. Our compute nodes are standardized on just-pre-1.2.


Subtasks 1 (0 open1 closed)

Task #5664: Review 5642-crunch-job-swap-limits-wipResolvedPeter Amstutz04/03/2015Actions
Actions #1

Updated by Brett Smith over 9 years ago

We tested out the original problem job with a Docker 1.5 package, and it failed the same way. An upgrade by itself will not solve the issue. We may actually need to change something in Crunch.

Actions #2

Updated by Brett Smith over 9 years ago

Did some more testing on my desktop. Even Docker 1.5 exhibits the kill behavior.

Using --memory-swap=-1 is documented, but doesn't work (this was apparently fixed recently in master but didn't make 1.5):

brinstar % docker run -ti --memory=4M --memory-swap=-1 arvados/base bash
docker: invalid size: '-1'. See 'docker run --help'.

But if you set --memory-swap to a real value, we get the desired behavior (note this loops longer):

brinstar % docker run -ti --memory=4M --memory-swap=32M arvados/base bash
root@bdcc762ea630:/# x=x
root@bdcc762ea630:/# seq 1 10000000 | while read n; do echo $n; x=$x$x; done
1
2
…
23
24
Killed
root@bdcc762ea630:/# exit

I think in order to get the behavior we really wanted from #5425, we should extend Crunch to introspect the amount of swap available and set --memory-swap to $MEMLIMIT plus that.

Actions #3

Updated by Brett Smith over 9 years ago

  • Status changed from New to In Progress
Actions #4

Updated by Brett Smith over 9 years ago

5642-crunch-job-swap-limits-wip is up for review with to implement the fix suggested above. I tested it in Docker, with an arvados/compute image hacked to have Docker 1.5 installed. Interesting bit from the job log:

Running [/usr/bin/docker.io run --rm=true --attach=stdout --attach=stderr --attach=stdin -i --user=crunch --cidfile=/tmp/crunch-job/bcsdk-ot0gb-c3unybuy99qzpmg-0.cid --sig-proxy --memory=14793046k --memory-swap=31570258k --dns 10.0.42.1 …

This both gives all of RAM to the single task, and gives it access to all of swap as well.

Actions #5

Updated by Brett Smith over 9 years ago

  • Story points set to 0.5
Actions #6

Updated by Peter Amstutz over 9 years ago

Debian Jessie has Docker 1.3.3, which does not have the --memory-swap flag. I'm not sure when it was introduced, but it seems to be after 1.3.

If a job has two nodes allocated and there are 2 tasks, it looks like it will still only allocate 50% memory to each task? (However, we don't control precisely which nodes the tasks are scheduled to, so maybe we can't assume that each task is placed on its own node.)

Actions #7

Updated by Brett Smith over 9 years ago

Peter Amstutz wrote:

Debian Jessie has Docker 1.3.3, which does not have the --memory-swap flag. I'm not sure when it was introduced, but it seems to be after 1.3.

I've been coordinating with Ward to do an upgrade to 1.5 as part of rolling this out, and I've been testing with packages he's prepared. Given the behavior we've observed, I'm not sure what other option we would have besides rolling back the memory limits.

If a job has two nodes allocated and there are 2 tasks, it looks like it will still only allocate 50% memory to each task? (However, we don't control precisely which nodes the tasks are scheduled to, so maybe we can't assume that each task is placed on its own node.)

Yeah, all of that is correct. crunch-job would have to get smarter about how it schedules work on slots for us to be able to be smarter about scaling the memory limit. This is just intended to grab the low-hanging fruit: one large task is a very common use case, so giving it all the RAM automatically seems worthwhile.

Thanks.

Actions #8

Updated by Peter Amstutz over 9 years ago

Brett Smith wrote:

Peter Amstutz wrote:

Debian Jessie has Docker 1.3.3, which does not have the --memory-swap flag. I'm not sure when it was introduced, but it seems to be after 1.3.

I've been coordinating with Ward to do an upgrade to 1.5 as part of rolling this out, and I've been testing with packages he's prepared. Given the behavior we've observed, I'm not sure what other option we would have besides rolling back the memory limits.

How hard would it be to include a version check? Now that Docker is starting to get included into Linux distributions, it's increasingly likely that users are going to be on older versions. Or we should explicitly decide that going forward we won't support Docker < 1.5.

If a job has two nodes allocated and there are 2 tasks, it looks like it will still only allocate 50% memory to each task? (However, we don't control precisely which nodes the tasks are scheduled to, so maybe we can't assume that each task is placed on its own node.)

Yeah, all of that is correct. crunch-job would have to get smarter about how it schedules work on slots for us to be able to be smarter about scaling the memory limit. This is just intended to grab the low-hanging fruit: one large task is a very common use case, so giving it all the RAM automatically seems worthwhile.

I agree, just wanted to check that I wasn't missing anything subtle that crunch-job was doing.

Actions #9

Updated by Brett Smith over 9 years ago

Peter Amstutz wrote:

Brett Smith wrote:

Peter Amstutz wrote:

Debian Jessie has Docker 1.3.3, which does not have the --memory-swap flag. I'm not sure when it was introduced, but it seems to be after 1.3.

I've been coordinating with Ward to do an upgrade to 1.5 as part of rolling this out, and I've been testing with packages he's prepared. Given the behavior we've observed, I'm not sure what other option we would have besides rolling back the memory limits.

How hard would it be to include a version check?

Not too bad. 71cbc63 expressly checks that Docker supports --memory-swap, and only sets --memory and --memory-swap when it does. Ready for another look.

Actions #10

Updated by Peter Amstutz over 9 years ago

LGTM

Actions #11

Updated by Brett Smith over 9 years ago

  • Status changed from In Progress to Resolved

Applied in changeset arvados|commit:e009f03d9f3a620f375cf187291bd12c2bf1d74c.

Actions #12

Updated by Nico César over 9 years ago

same with docker 1.6.0


compute63.qr1hi:~# docker --version
Docker version 1.6.0, build 4749651
compute63.qr1hi:~# docker images
REPOSITORY          TAG                 IMAGE ID            CREATED             VIRTUAL SIZE
<none>              <none>              d33416e64af4        13 days ago         1.09 GB
compute63.qr1hi:~# docker run -ti --memory=4M d33416e64af4 bash
root@a9679135aeb3:/# dash
# x=x
# seq 1 10000000 | while read n; do echo $n; x=$x$x; done
1
2
(..)
21
22
Killed
# echo $?
137

Actions #13

Updated by Brett Smith over 9 years ago

Nico Cesar wrote:

same with docker 1.6.0

The branch set --memory-swap to make the node's entire swap available to the process, which avoids the issue. See the transcript in note-2, and the log snippet in note-3 that shows Docker being invoked with the option using the patched crunch-job. Upgrading Docker was necessary to make the branch effective, because --memory-swap was not available in older versions.

Actions

Also available in: Atom PDF