Bug #5642
closed[Crunch] subprocesses get SIGKILL when they exceed Docker memory limit on compute nodes
Description
You can reproduce this outside Crunch, using any Docker image:
compute2.qr1hi:/home/brettadmin# docker run -ti --memory=4M <IMAGE> bash crunch@e5291ea9db1c:/$ dash $ x=x $ seq 1 10000000 | while read n; do echo $n; x=$x$x; done 1 2 … 21 22 Killed $ echo $? 137
We expected code to get access to swap after exceeding the memory limit, but that's clearly not happening on the compute nodes. One possible improvement would be to upgrade Docker. #5425-10 suggests that is observed behavior from older Docker, and that more recent Docker is better about this. Our compute nodes are standardized on just-pre-1.2.
Updated by Brett Smith over 9 years ago
We tested out the original problem job with a Docker 1.5 package, and it failed the same way. An upgrade by itself will not solve the issue. We may actually need to change something in Crunch.
Updated by Brett Smith over 9 years ago
Did some more testing on my desktop. Even Docker 1.5 exhibits the kill behavior.
Using --memory-swap=-1
is documented, but doesn't work (this was apparently fixed recently in master but didn't make 1.5):
brinstar % docker run -ti --memory=4M --memory-swap=-1 arvados/base bash docker: invalid size: '-1'. See 'docker run --help'.
But if you set --memory-swap to a real value, we get the desired behavior (note this loops longer):
brinstar % docker run -ti --memory=4M --memory-swap=32M arvados/base bash root@bdcc762ea630:/# x=x root@bdcc762ea630:/# seq 1 10000000 | while read n; do echo $n; x=$x$x; done 1 2 … 23 24 Killed root@bdcc762ea630:/# exit
I think in order to get the behavior we really wanted from #5425, we should extend Crunch to introspect the amount of swap available and set --memory-swap
to $MEMLIMIT
plus that.
Updated by Brett Smith over 9 years ago
5642-crunch-job-swap-limits-wip is up for review with to implement the fix suggested above. I tested it in Docker, with an arvados/compute image hacked to have Docker 1.5 installed. Interesting bit from the job log:
Running [/usr/bin/docker.io run --rm=true --attach=stdout --attach=stderr --attach=stdin -i --user=crunch --cidfile=/tmp/crunch-job/bcsdk-ot0gb-c3unybuy99qzpmg-0.cid --sig-proxy --memory=14793046k --memory-swap=31570258k --dns 10.0.42.1 …
This both gives all of RAM to the single task, and gives it access to all of swap as well.
Updated by Peter Amstutz over 9 years ago
Debian Jessie has Docker 1.3.3, which does not have the --memory-swap flag. I'm not sure when it was introduced, but it seems to be after 1.3.
If a job has two nodes allocated and there are 2 tasks, it looks like it will still only allocate 50% memory to each task? (However, we don't control precisely which nodes the tasks are scheduled to, so maybe we can't assume that each task is placed on its own node.)
Updated by Brett Smith over 9 years ago
Peter Amstutz wrote:
Debian Jessie has Docker 1.3.3, which does not have the --memory-swap flag. I'm not sure when it was introduced, but it seems to be after 1.3.
I've been coordinating with Ward to do an upgrade to 1.5 as part of rolling this out, and I've been testing with packages he's prepared. Given the behavior we've observed, I'm not sure what other option we would have besides rolling back the memory limits.
If a job has two nodes allocated and there are 2 tasks, it looks like it will still only allocate 50% memory to each task? (However, we don't control precisely which nodes the tasks are scheduled to, so maybe we can't assume that each task is placed on its own node.)
Yeah, all of that is correct. crunch-job would have to get smarter about how it schedules work on slots for us to be able to be smarter about scaling the memory limit. This is just intended to grab the low-hanging fruit: one large task is a very common use case, so giving it all the RAM automatically seems worthwhile.
Thanks.
Updated by Peter Amstutz over 9 years ago
Brett Smith wrote:
Peter Amstutz wrote:
Debian Jessie has Docker 1.3.3, which does not have the --memory-swap flag. I'm not sure when it was introduced, but it seems to be after 1.3.
I've been coordinating with Ward to do an upgrade to 1.5 as part of rolling this out, and I've been testing with packages he's prepared. Given the behavior we've observed, I'm not sure what other option we would have besides rolling back the memory limits.
How hard would it be to include a version check? Now that Docker is starting to get included into Linux distributions, it's increasingly likely that users are going to be on older versions. Or we should explicitly decide that going forward we won't support Docker < 1.5.
If a job has two nodes allocated and there are 2 tasks, it looks like it will still only allocate 50% memory to each task? (However, we don't control precisely which nodes the tasks are scheduled to, so maybe we can't assume that each task is placed on its own node.)
Yeah, all of that is correct. crunch-job would have to get smarter about how it schedules work on slots for us to be able to be smarter about scaling the memory limit. This is just intended to grab the low-hanging fruit: one large task is a very common use case, so giving it all the RAM automatically seems worthwhile.
I agree, just wanted to check that I wasn't missing anything subtle that crunch-job was doing.
Updated by Brett Smith over 9 years ago
Peter Amstutz wrote:
Brett Smith wrote:
Peter Amstutz wrote:
Debian Jessie has Docker 1.3.3, which does not have the --memory-swap flag. I'm not sure when it was introduced, but it seems to be after 1.3.
I've been coordinating with Ward to do an upgrade to 1.5 as part of rolling this out, and I've been testing with packages he's prepared. Given the behavior we've observed, I'm not sure what other option we would have besides rolling back the memory limits.
How hard would it be to include a version check?
Not too bad. 71cbc63 expressly checks that Docker supports --memory-swap
, and only sets --memory
and --memory-swap
when it does. Ready for another look.
Updated by Brett Smith over 9 years ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|commit:e009f03d9f3a620f375cf187291bd12c2bf1d74c.
Updated by Nico César over 9 years ago
same with docker 1.6.0
compute63.qr1hi:~# docker --version Docker version 1.6.0, build 4749651 compute63.qr1hi:~# docker images REPOSITORY TAG IMAGE ID CREATED VIRTUAL SIZE <none> <none> d33416e64af4 13 days ago 1.09 GB compute63.qr1hi:~# docker run -ti --memory=4M d33416e64af4 bash root@a9679135aeb3:/# dash # x=x # seq 1 10000000 | while read n; do echo $n; x=$x$x; done 1 2 (..) 21 22 Killed # echo $? 137
Updated by Brett Smith over 9 years ago
Nico Cesar wrote:
same with docker 1.6.0
The branch set --memory-swap
to make the node's entire swap available to the process, which avoids the issue. See the transcript in note-2, and the log snippet in note-3 that shows Docker being invoked with the option using the patched crunch-job. Upgrading Docker was necessary to make the branch effective, because --memory-swap
was not available in older versions.