Bug #5642: [Crunch] subprocesses get SIGKILL when they exceed Docker memory limit on compute nodes - Arvados

Actions

Copy link

Bug #5642

closed

[Crunch] subprocesses get SIGKILL when they exceed Docker memory limit on compute nodes

Added by Brett Smith about 10 years ago. Updated almost 10 years ago.

Status:

Resolved

Priority:

Normal

Assigned To:

Brett Smith

Category:

Crunch

Target version:

2015-04-29 sprint

Story points:

0.5

Description

You can reproduce this outside Crunch, using any Docker image:

compute2.qr1hi:/home/brettadmin# docker run -ti --memory=4M <IMAGE> bash
crunch@e5291ea9db1c:/$ dash
$ x=x
$ seq 1 10000000 | while read n; do echo $n; x=$x$x; done
1
2
…
21
22
Killed
$ echo $?
137

We expected code to get access to swap after exceeding the memory limit, but that's clearly not happening on the compute nodes. One possible improvement would be to upgrade Docker. #5425-10 suggests that is observed behavior from older Docker, and that more recent Docker is better about this. Our compute nodes are standardized on just-pre-1.2.

Subtasks 1 (0 open — 1 closed)

Actions

Copy link

Updated by Brett Smith about 10 years ago

We tested out the original problem job with a Docker 1.5 package, and it failed the same way. An upgrade by itself will not solve the issue. We may actually need to change something in Crunch.

Actions

Copy link

Updated by Brett Smith about 10 years ago

Did some more testing on my desktop. Even Docker 1.5 exhibits the kill behavior.

Using --memory-swap=-1 is documented, but doesn't work (this was apparently fixed recently in master but didn't make 1.5):

brinstar % docker run -ti --memory=4M --memory-swap=-1 arvados/base bash
docker: invalid size: '-1'. See 'docker run --help'.

But if you set --memory-swap to a real value, we get the desired behavior (note this loops longer):

brinstar % docker run -ti --memory=4M --memory-swap=32M arvados/base bash
root@bdcc762ea630:/# x=x
root@bdcc762ea630:/# seq 1 10000000 | while read n; do echo $n; x=$x$x; done
1
2
…
23
24
Killed
root@bdcc762ea630:/# exit

I think in order to get the behavior we really wanted from #5425, we should extend Crunch to introspect the amount of swap available and set --memory-swap to $MEMLIMIT plus that.

Actions

Copy link

Updated by Brett Smith about 10 years ago

Status changed from New to In Progress

Actions

Copy link

Updated by Brett Smith about 10 years ago

5642-crunch-job-swap-limits-wip is up for review with to implement the fix suggested above. I tested it in Docker, with an arvados/compute image hacked to have Docker 1.5 installed. Interesting bit from the job log:

Running [/usr/bin/docker.io run --rm=true --attach=stdout --attach=stderr --attach=stdin -i --user=crunch --cidfile=/tmp/crunch-job/bcsdk-ot0gb-c3unybuy99qzpmg-0.cid --sig-proxy --memory=14793046k --memory-swap=31570258k --dns 10.0.42.1 …

This both gives all of RAM to the single task, and gives it access to all of swap as well.

Actions

Copy link

Updated by Brett Smith almost 10 years ago

Story points set to 0.5

Actions

Copy link

Updated by Peter Amstutz almost 10 years ago

Debian Jessie has Docker 1.3.3, which does not have the --memory-swap flag. I'm not sure when it was introduced, but it seems to be after 1.3.

If a job has two nodes allocated and there are 2 tasks, it looks like it will still only allocate 50% memory to each task? (However, we don't control precisely which nodes the tasks are scheduled to, so maybe we can't assume that each task is placed on its own node.)

Actions

Copy link

Updated by Brett Smith almost 10 years ago

Peter Amstutz wrote:

Debian Jessie has Docker 1.3.3, which does not have the --memory-swap flag. I'm not sure when it was introduced, but it seems to be after 1.3.

I've been coordinating with Ward to do an upgrade to 1.5 as part of rolling this out, and I've been testing with packages he's prepared. Given the behavior we've observed, I'm not sure what other option we would have besides rolling back the memory limits.

If a job has two nodes allocated and there are 2 tasks, it looks like it will still only allocate 50% memory to each task? (However, we don't control precisely which nodes the tasks are scheduled to, so maybe we can't assume that each task is placed on its own node.)

Yeah, all of that is correct. crunch-job would have to get smarter about how it schedules work on slots for us to be able to be smarter about scaling the memory limit. This is just intended to grab the low-hanging fruit: one large task is a very common use case, so giving it all the RAM automatically seems worthwhile.

Thanks.

Actions

Copy link

Updated by Peter Amstutz almost 10 years ago

Brett Smith wrote:

Peter Amstutz wrote:

Debian Jessie has Docker 1.3.3, which does not have the --memory-swap flag. I'm not sure when it was introduced, but it seems to be after 1.3.

I've been coordinating with Ward to do an upgrade to 1.5 as part of rolling this out, and I've been testing with packages he's prepared. Given the behavior we've observed, I'm not sure what other option we would have besides rolling back the memory limits.

How hard would it be to include a version check? Now that Docker is starting to get included into Linux distributions, it's increasingly likely that users are going to be on older versions. Or we should explicitly decide that going forward we won't support Docker < 1.5.

If a job has two nodes allocated and there are 2 tasks, it looks like it will still only allocate 50% memory to each task? (However, we don't control precisely which nodes the tasks are scheduled to, so maybe we can't assume that each task is placed on its own node.)

Yeah, all of that is correct. crunch-job would have to get smarter about how it schedules work on slots for us to be able to be smarter about scaling the memory limit. This is just intended to grab the low-hanging fruit: one large task is a very common use case, so giving it all the RAM automatically seems worthwhile.

I agree, just wanted to check that I wasn't missing anything subtle that crunch-job was doing.

Actions

Copy link

Updated by Brett Smith almost 10 years ago

Peter Amstutz wrote:

Brett Smith wrote:

Peter Amstutz wrote:

Debian Jessie has Docker 1.3.3, which does not have the --memory-swap flag. I'm not sure when it was introduced, but it seems to be after 1.3.

I've been coordinating with Ward to do an upgrade to 1.5 as part of rolling this out, and I've been testing with packages he's prepared. Given the behavior we've observed, I'm not sure what other option we would have besides rolling back the memory limits.

How hard would it be to include a version check?

Not too bad. 71cbc63 expressly checks that Docker supports --memory-swap, and only sets --memory and --memory-swap when it does. Ready for another look.

Actions

Copy link

#10

Updated by Peter Amstutz almost 10 years ago

LGTM

Actions

Copy link

#11

Updated by Brett Smith almost 10 years ago

Status changed from In Progress to Resolved

Applied in changeset arvados|commit:e009f03d9f3a620f375cf187291bd12c2bf1d74c.

Actions

Copy link

#12

Updated by Nico César almost 10 years ago

same with docker 1.6.0


compute63.qr1hi:~# docker --version
Docker version 1.6.0, build 4749651
compute63.qr1hi:~# docker images
REPOSITORY          TAG                 IMAGE ID            CREATED             VIRTUAL SIZE
<none>              <none>              d33416e64af4        13 days ago         1.09 GB
compute63.qr1hi:~# docker run -ti --memory=4M d33416e64af4 bash
root@a9679135aeb3:/# dash
# x=x
# seq 1 10000000 | while read n; do echo $n; x=$x$x; done
1
2
(..)
21
22
Killed
# echo $?
137

Actions

Copy link

#13

Updated by Brett Smith almost 10 years ago

Nico Cesar wrote:

same with docker 1.6.0

The branch set --memory-swap to make the node's entire swap available to the process, which avoids the issue. See the transcript in note-2, and the log snippet in note-3 that shows Docker being invoked with the option using the patched crunch-job. Upgrading Docker was necessary to make the branch effective, because --memory-swap was not available in older versions.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Arvados

Custom queries

Bug #5642

[Crunch] subprocesses get SIGKILL when they exceed Docker memory limit on compute nodes

Updated by Brett Smith about 10 years ago

Updated by Brett Smith about 10 years ago

Updated by Brett Smith about 10 years ago

Updated by Brett Smith about 10 years ago

Updated by Brett Smith almost 10 years ago

Updated by Peter Amstutz almost 10 years ago

Updated by Brett Smith almost 10 years ago

Updated by Peter Amstutz almost 10 years ago

Updated by Brett Smith almost 10 years ago

Updated by Peter Amstutz almost 10 years ago

Updated by Brett Smith almost 10 years ago

Updated by Nico César almost 10 years ago

Updated by Brett Smith almost 10 years ago