Bug #3824
closed[Crunch] crunch-job should create task execution environment inside docker container, not on worker host.
Description
Cf. a5819ec1e48fba90658fcf676ffc50c1f216d484 and 0eb5711ade0f74e556b0a1c10909dbf0bdecb63f
Some refactoring of that section of crunch-job may be in order here.
Requirements:- JOB_WORK and TASK_WORK must not be the same directory.
- JOB_WORK and TASK_WORK must exist when a crunch script starts.
- JOB_WORK and TASK_WORK (and other crunch-specific directories) are not required to exist in the docker image being used. (A design goal is that non-Arvados-aware docker images can be used to run jobs. Expecting special tmp directories won't fly. But we might have to make some reasonable assumptions like "docker image has a world-writable /tmp".)
- Run as much stuff as possible inside the docker container. E.g., the "build script" should run inside the container. It should have no trouble creating the required directories.
- Something like a5819ec1e48fba90658fcf676ffc50c1f216d484 but fix whatever failed to propagate log messages, and make sure to create CRUNCH_SRC in the container
docker run -a stdin -i
seems to be necessary to get stdin through to the container. See #4044 notes.- We probably want to use
docker run --sig-proxy
too.
Updated by Ward Vandewege over 10 years ago
- Subject changed from Restore job temp directory to crunch job. to [Crunch] Restore job temp directory to crunch job.
Updated by Ward Vandewege over 10 years ago
- Subject changed from [Crunch] Restore job temp directory to crunch job. to [Crunch] Restore job temp directory to crunch-job.
Updated by Tom Clegg over 10 years ago
- Description updated (diff)
- Category set to Crunch
Updated by Tom Clegg over 10 years ago
- Subject changed from [Crunch] Restore job temp directory to crunch-job. to [Crunch] Create task execution environment inside docker container, not on worker host.
Updated by Tom Clegg about 10 years ago
- Subject changed from [Crunch] Create task execution environment inside docker container, not on worker host. to [Crunch] crunch-job should create task execution environment and run crunchstat inside docker container, not on worker host.
Updated by Ward Vandewege about 10 years ago
- Target version changed from Arvados Future Sprints to 2014-10-29 sprint
Updated by Tom Clegg about 10 years ago
- Subject changed from [Crunch] crunch-job should create task execution environment and run crunchstat inside docker container, not on worker host. to [Crunch] crunch-job should create task execution environment and collect stats inside docker container, not on worker host.
Updated by Tom Clegg about 10 years ago
- Subject changed from [Crunch] crunch-job should create task execution environment and collect stats inside docker container, not on worker host. to [Crunch] crunch-job should create task execution environment inside docker container, not on worker host.
Updated by Tom Clegg about 10 years ago
3824-crunch-container-setup @ 30bdc9b
- JOB_WORK and TASK_WORK are docker data volumes, instead of bind mounts to a temp dir on the host.
While testing this I made some other fixes to docker and crunch-job:
- Bigger tmpfs for keep volumes (big enough to store the arvados/jobs image)
- In the compute image, install docker.io from the arvados package repository (otherwise "docker.io not found")
- Run dnsmasq in the compute container (otherwise it can't run containerized jobs: crunch-job assumes the compute worker is a DNS cache)
- Log() isn't right in a crunch-job child process: just die().
- Show its stderr messages if the install script fails during task setup.
- On Workbench jobs#show page, do not print a confusing message ("there are jobs ahead of this one") if queue_position is nil.
Updated by Brett Smith about 10 years ago
I'm working on building all the Docker infrastructure necessary to test this, but while I do, I wanted to provide my comments on a first pass of the diff at 30bdc9b.
- If you want to elucidate the comment about TASK_KEEPMOUNT: you may want to double-check me against the git log, but IIRC, the reason we use
/keep
is that when you specify a volume to put in the container, Docker will effectivelymkdir
but notmkdir -p
the destination. So/keep
was a destination that was likely to work and unlikely to conflict with other tools. The default destination wasn't usable because it's under TASK_WORK, which we didn't install inside the container (instead we've been setting a different TASK_WORK).
I'm not sure I understand your comment about tasks ignoring the value of TASK_KEEPMOUNT; the diff shows the previous code setting TASK_KEEPMOUNT=/keep. - I know this made token stripping easier, but I'm a little sad that the srun debugging lost the quoting around multi-word arguments. It wasn't perfect, but I think it was much less likely to be ambiguous compared to what's in the branch.
- In
shell_or_die
, I'm not sure you can count on$!
to stay useful as long as you do. I tested a little bit on my desktop:
if (system("false") != 0) { print "first system: $!\n"; my $exitstatus = sprintf("exit %d signal %d", $? >> 8, $? & 0x7f); print "set exitstatus: $!\n"; open(BCSERR, ">/tmp/perlerr.log") or die; print "open: $!\n"; system("true"); print "second system: $!\n"; die "$exitstatus"; }
For me,$!
starts empty, then becomes "Inappropriate ioctl for device" after the open call. I think if you want this information, you should probably stash it in$exitstatus
.
Updated by Brett Smith about 10 years ago
Brett Smith wrote:
I'm working on building all the Docker infrastructure necessary to test this, but while I do, I wanted to provide my comments on a first pass of the diff at 30bdc9b.
If I'm going to test this, I might need a hint on how to get into the Docker universe. I have it all built and started with arvdock, and it looks like things are running normally. But the only way I know to get an API token is to log in through Workbench. When I click the log in button, I land on a page at https://172.17.0.117/auth/joshid/callback?return_to=http%3A%2F%2Flocalhost%3A9899%2Fusers%2Fwelcomex%x%…, and it's a 500. It looks like #4296, except starting the login process over again doesn't help; I always get a 500.
Any hints? What are you doing on your setup between arvdock and submitting a job to the cluster?
Updated by Tom Clegg about 10 years ago
Brett Smith wrote:
I'm working on building all the Docker infrastructure necessary to test this, but while I do, I wanted to provide my comments on a first pass of the diff at 30bdc9b.
- If you want to elucidate the comment about TASK_KEEPMOUNT: you may want to double-check me against the git log, but IIRC, the reason we use
/keep
is that when you specify a volume to put in the container, Docker will effectivelymkdir
but notmkdir -p
the destination. So/keep
was a destination that was likely to work and unlikely to conflict with other tools. The default destination wasn't usable because it's under TASK_WORK, which we didn't install inside the container (instead we've been setting a different TASK_WORK).
I'm not sure I understand your comment about tasks ignoring the value of TASK_KEEPMOUNT; the diff shows the previous code setting TASK_KEEPMOUNT=/keep.
Ah. "Always mount it in /keep" seemed like an attempt to make it predictable for the benefit of someone who didn't want to bother looking at $TASK_KEEPMOUNT
. I didn't want to get into the question of who has started hard-coding /keep
instead of using $TASK_KEEPMOUNT
now that it looks sort of predictable, even if that's not why it originally went that way. (I see run-command
defaults to /keep
if TASK_KEEPMOUNT
is not set.)
I tried this (with Docker version 1.2.0-dev, build dc243c8) with success, so the mkdir -p
issue, if any, seems fixed now:
# docker.io run -it --volume=/var/tmp:/foo/bar/baz/waz:ro ubuntu ls -la /foo/bar/baz/waz/ total 0 drwxrwxrwt 2 root root 6 Oct 28 05:17 . drwxr-xr-x 3 root root 16 Oct 28 05:22 ..
It looks like the "use something outside /tmp/crunch-job" trend started (with /mnt
) in 0eb77fba3f7de714a7edef1c57491f3c285f6d67, whose comment attributes it to uid mapping problems.
That makes sense for the *TMP
directories, which have to be writable (although I think I've solved that simply by using data volumes instead of host mounts). I'm not sure what the uid mapping problem is for the keep mount, since it's read-only and we use --allow-other
and have user_allow_other
in /etc/fuse.conf
...
I notice one read-only mount inside another is not so easy:
docker.io run -it --volume=/var/tmp:/foo/bar/baz/waz:ro --volume=/var:/foo/bar:ro ... setup mount namespace creating new bind mount target mkdir /var/lib/docker/aufs/mnt/f24df0c3[...]/foo/bar/baz: [...]: read-only file system
But overlapping writable mounts seem ok.
- I know this made token stripping easier, but I'm a little sad that the srun debugging lost the quoting around multi-word arguments. It wasn't perfect, but I think it was much less likely to be ambiguous compared to what's in the branch.
Good point. I'd like to get better logging here because we're currently relying on crunchstat, of all things, to spit out the portion of the "start task" command that comes after crunchstat. I didn't like the way the existing quoting assumed (incorrectly) that there were no single quotes in any $args
. New version:
my $show_cmd = join(" ", map {
if (/[\s\"]/) {
s/[\"\$\\]/\\$&/g;
"\"$_\"";
} else {
$_;
}} @{$args});
Theoretically \Q\E
would be correct, but that makes it even less human-readable, and it's already bad enough. (By the time we get here, we've already done a bunch of \Q\E
quoting, and it escapes stuff like [-:=/]
...)
- In
shell_or_die
, I'm not sure you can count on$!
to stay useful as long as you do. I tested a little bit on my desktop:
[...]
For me,$!
starts empty, then becomes "Inappropriate ioctl for device" after the open call. I think if you want this information, you should probably stash it in$exitstatus
.
Oops. I looked at man perlfunc
and it seems $!
is not interesting after system()
anyway: "you can check all possible failure modes by inspecting $? like this:" ... so I removed that part of the dying breath.
Logging updates @ fb1bf9f
Updated by Tom Clegg about 10 years ago
Brett Smith wrote:
When I click the log in button, I land on a page at https://172.17.0.117/auth/joshid/callback?return_to=http%3A%2F%2Flocalhost%3A9899%2Fusers%2Fwelcomex%x%…, and it's a 500. It looks like #4296, except starting the login process over again doesn't help; I always get a 500.
Hm, I haven't hit that one. The main hurdle I've had is that I'm not running docker + browser on the same box, so 172.17.* don't work. After clicking "log in" I have to change the apiserver login url in my browser from https://172.17.*/login
to the 192.168.*:9900 address that gets port-forwarded, and it works from there.
Another way: arvdock
helpfully clobbers your existing ~/.config/arvados/settings.conf with root credentials and host settings for your docker api server, so "arv user current"
should say you're root...
Updated by Brett Smith about 10 years ago
Tom Clegg wrote:
Brett Smith wrote:
- If you want to elucidate the comment about TASK_KEEPMOUNT: you may want to double-check me against the git log, but IIRC, the reason we use
/keep
is that when you specify a volume to put in the container, Docker will effectivelymkdir
but notmkdir -p
the destination.Ah. "Always mount it in /keep" seemed like an attempt to make it predictable for the benefit of someone who didn't want to bother looking at
$TASK_KEEPMOUNT
.
I realize it has that effect, but I never understood that as the motivation for any of the related changes that led us up to this point.
It looks like the "use something outside /tmp/crunch-job" trend started (with
/mnt
) in 0eb77fba3f7de714a7edef1c57491f3c285f6d67, whose comment attributes it to uid mapping problems.That makes sense for the
*TMP
directories, which have to be writable (although I think I've solved that simply by using data volumes instead of host mounts). I'm not sure what the uid mapping problem is for the keep mount, since it's read-only and we use--allow-other
and haveuser_allow_other
in/etc/fuse.conf
...
I think you already know this, but just to make sure we're on the same page, note that --allow-other
doesn't override POSIX permissions. It's the other way around: if you don't allow others to read the mount, then they'll be forbidden no matter what the POSIX permisions say.
But that doesn't detract from your main point that UID mapping should never have been an issue for the FUSE mount, where everything's world-readable. Since the same commit changes both the temp directories and the FUSE mount, it looks to me like UID mapping was the motivation for the tempdir changes, and then moving the Keep mount was sort of a side effect of that.
I agree that the volume approach should make the UID mapping issue moot.
- I know this made token stripping easier, but I'm a little sad that the srun debugging lost the quoting around multi-word arguments. It wasn't perfect, but I think it was much less likely to be ambiguous compared to what's in the branch.
Good point. I'd like to get better logging here because we're currently relying on crunchstat, of all things, to spit out the portion of the "start task" command that comes after crunchstat. I didn't like the way the existing quoting assumed (incorrectly) that there were no single quotes in any
$args
. New version:
Would it make sense to just use Data::Dumper here? It seems like that's what we really want.
I looked at
man perlfunc
and it seems$!
is not interesting aftersystem()
anyway: "you can check all possible failure modes by inspecting $? like this:" ... so I removed that part of the dying breath.
$?
may report everything, but FWIW, $!
does seem to have a useful string in failure modes beyond "child exited nonzero." For example, it says "No such file or directory" if you try to run something that doesn't exist. I'm fine with the current version, just sharing what I saw in my testing.
Updated by Brett Smith about 10 years ago
I got the hash job to run correctly inside a Docker container inside my Docker cluster. That's pretty awesome. Thanks for helping me through it. A couple of things this shook out from the branch:
- As discussed on IRC, the logging in
srun
can't mutate the strings in $args, lest it interfere with shell quoting or other semantics. - There's no need to add our apt source in
compute/Dockerfile
, because that's already taken care of by arvados/base, which we inherit from.
And then some longer-standing issues in the Dockerfiles. I'm okay with seeing these addressed in the branch, or separately:
- Running
apt-get update
more than once per Dockerfile is redundant. I suggest trimming extraneous calls at least in the touched Dockerfiles. - We should set permissions on
/etc/fuse.conf
when we add it to the compute image, because it needs to be world-readable and the builder's umask might not be.
Thanks.
Updated by Tom Clegg about 10 years ago
- Target version changed from 2014-10-29 sprint to 2014-11-19 sprint
Updated by Tom Clegg about 10 years ago
Using Data::Dumper, the log message now looks like this:
Thu Oct 30 17:07:28 2014 2y486-8i9sb-i7pbmv8h0xd8efc 20198 1 stderr starting: ['srun','--nodelist=compute1','-n1','-c1','-N1','-D','/tmp','--job-name=2y486-8i9sb-i7pbmv8h0xd8efc.1.20477','bash','-c','if [ -e /tmp/crunch-job/task/compute1.1 ]; then rm -rf /tmp/crunch-job/task/compute1.1; fi; mkdir -p /tmp/crunch-job /tmp/crunch-job/work /tmp/crunch-job/task/compute1.1 /tmp/crunch-job/task/compute1.1.keep && cd /tmp/crunch-job && perl -&& exec arv-mount --by-id --allow-other /tmp/crunch-job/task/compute1.1.keep --exec crunchstat -cgroup-root=/sys/fs/cgroup -cgroup-parent=docker -cgroup-cid=/tmp/crunch-job/2y486-ot0gb-j601wipo850mzm3.cid -poll=10000 /usr/bin/docker.io run --rm=true --attach=stdout --attach=stderr --attach=stdin -i --user=crunch --cidfile=/tmp/crunch-job/2y486-ot0gb-j601wipo850mzm3.cid --sig-proxy $(ip -o address show scope global | gawk \'match($4, /^([0-9\\.:]+)\\//, x){print "--dns", x[1]}\') --volume=\\/tmp\\/crunch\\-job\\/src\\:\\/tmp\\/crunch\\-job\\/src\\:ro --volume=\\/tmp\\/crunch\\-job\\/opt\\:\\/tmp\\/crunch\\-job\\/opt\\:ro --volume=\\/tmp\\/crunch\\-job\\/task\\/compute1\\.1\\.keep\\:\\/keep\\:ro --volume=\\/tmp\\/crunch\\-job\\/task\\/compute1\\.1 --volume=\\/tmp\\/crunch\\-job\\/work --env=CRUNCH_JOB_BIN\\=\\/usr\\/src\\/arvados\\/services\\/crunch\\/crunch\\-job --env=TASK_SEQUENCE\\=1 --env=TASK_KEEPMOUNT\\=\\/keep --env=CRUNCH_SRC_COMMIT\\=e2fe6c0e5c1c62a37e03519590c04a5186a2cc9b --env=TASK_QSEQUENCE\\=1 --env=CRUNCH_INSTALL\\=\\/tmp\\/crunch\\-job\\/opt --env=CRUNCH_REFRESH_TRIGGER\\=\\/tmp\\/crunch_refresh_trigger --env=ARVADOS_API_TOKEN\\=[...] --env=CRUNCH_WORK\\=\\/tmp\\/crunch\\-job\\/work --env=CRUNCH_TMP\\=\\/tmp\\/crunch\\-job --env=TASK_TMPDIR\\=\\/tmp\\/crunch\\-job\\/task\\/compute1\\.1 --env=JOB_UUID\\=2y486\\-8i9sb\\-i7pbmv8h0xd8efc --env=CRUNCH_JOB_UUID\\=2y486\\-8i9sb\\-i7pbmv8h0xd8efc --env=TASK_SLOT_NUMBER\\=1 --env=CRUNCH_SRC_URL\\=\\/var\\/lib\\/arvados\\/internal\\.git --env=TASK_SLOT_NODE\\=compute1 --env=JOB_PARAMETER_INPUT\\=83367e8913dcec0bf3fc25ed5a27eacb\\+49 --env=ARVADOS_API_HOST_INSECURE\\=yes --env=JOB_SCRIPT\\=hash --env=CRUNCH_NODE_SLOTS\\=1 --env=TASK_WORK\\=\\/tmp\\/crunch\\-job\\/task\\/compute1\\.1 --env=ARVADOS_API_HOST\\=api --env=JOB_WORK\\=\\/tmp\\/crunch\\-job\\/work --env=TASK_UUID\\=2y486\\-ot0gb\\-j601wipo850mzm3 --env=CRUNCH_SRC\\=\\/tmp\\/crunch\\-job\\/src --env=HOME\\=\\/tmp\\/crunch\\-job\\/task\\/compute1\\.1 67efcf13b3cc5e240cabf938e7d7c72fb033425d2e9b4b1d65e3a1e96e6bdaad stdbuf --output=0 --error=0 /tmp/crunch-job/src/crunch_scripts/hash']
Removed the redundant "add arvados repo" and just added docker.io to the existing apt-get install
command instead.
Brought back $!
(safely, via my $err
).
Now at da6674e
Updated by Tom Clegg about 10 years ago
Brett Smith wrote:
And then some longer-standing issues in the Dockerfiles. I'm okay with seeing these addressed in the branch, or separately:
- Running
apt-get update
more than once per Dockerfile is redundant. I suggest trimming extraneous calls at least in the touched Dockerfiles.
Hm, they're redundant if you build all of the images at once, but I'm not confident we should assume that. Would "apt-get update unless /var/cache/apt/pkgcache.bin is <1 hour old" be better? (Trying to choose an ideal threshold mostly makes me inclined to err on the side of caution and leave apt-get update in there for the sake of predictability.)
Added #4366 so we can deliberate further. :)
- We should set permissions on
/etc/fuse.conf
when we add it to the compute image, because it needs to be world-readable and the builder's umask might not be.
Done in 1bcfc05
Updated by Tom Clegg about 10 years ago
Tom Clegg wrote:
Brett Smith wrote:
And then some longer-standing issues in the Dockerfiles. I'm okay with seeing these addressed in the branch, or separately:
- Running
apt-get update
more than once per Dockerfile is redundant. I suggest trimming extraneous calls at least in the touched Dockerfiles.Hm, they're redundant if you build all of the images at once, but I'm not confident we should assume that. Would "apt-get update unless /var/cache/apt/pkgcache.bin is <1 hour old" be better? (Trying to choose an ideal threshold mostly makes me inclined to err on the side of caution and leave apt-get update in there for the sake of predictability.)
Added #4366 so we can deliberate further. :)
Sorry, I misread your comment. Removed redundant "apt-get update"s from Dockerfiles that had more than one (base and compute). Also went through (most of) the Dockerfiles to make them use "apt-get update -qq" and "apt-get install -qqy" for consistency and greppability. (Greppibility?)
Removed #4366.
Also, added GPG key to base image so latest RVM can build -- see https://github.com/wayneeseguin/rvm/commit/7386864bc5e47de1d5cbbd339f9e008d0811c181
Now at c199c0c
Updated by Brett Smith about 10 years ago
Reviewing c199c0c, and my only concern is the TASK_KEEPMOUNT comment. Looking at it again, I feel more strongly about it: crunch-job has always set TASK_KEEPMOUNT correctly, and Crunch authors have consistently been told to use it. Any scripts that do otherwise are buggy, and I worry that the comment might give future readers the idea that this is a corner case crunch-job must support in perpetuity. Everything else looks good to me.
Updated by Tom Clegg about 10 years ago
Brett Smith wrote:
Reviewing c199c0c, and my only concern is the TASK_KEEPMOUNT comment. Looking at it again, I feel more strongly about it: crunch-job has always set TASK_KEEPMOUNT correctly, and Crunch authors have consistently been told to use it. Any scripts that do otherwise are buggy, and I worry that the comment might give future readers the idea that this is a corner case crunch-job must support in perpetuity. Everything else looks good to me.
Indeed, when I wrote that comment I meant to mock the idea of making the path predictable, but I see now that the sarcasm is unhelpful. I've fixed the comment so it says clearly (I think!) to do the right thing. b39e2b4
+ # Currently, we make arv-mount's mount point appear at /keep
+ # inside the container (instead of using the same path as the
+ # host like we do with CRUNCH_SRC and CRUNCH_INSTALL). However,
+ # crunch scripts and utilities must not rely on this. They must
+ # use $TASK_KEEPMOUNT.
$command .= "--volume=\Q$ENV{TASK_KEEPMOUNT}:/keep:ro\E ";
$ENV{TASK_KEEPMOUNT} = "/keep";
Updated by Anonymous about 10 years ago
- Status changed from In Progress to Resolved
- % Done changed from 78 to 100
Applied in changeset arvados|commit:c6a03a7abff947dc8242e8be18b4b5e6920a3e4a.
Updated by Tom Clegg about 10 years ago
- Status changed from Resolved to In Progress
Updated by Peter Amstutz about 10 years ago
Reviewing 1137425
I still can't run a job inside Docker:
Mon Nov 17 10:34:22 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 check slurm allocation Mon Nov 17 10:34:22 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 node localhost - 4 slots Mon Nov 17 10:34:22 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 start Mon Nov 17 10:34:22 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 Clean work dirs Mon Nov 17 10:34:22 2014 starting: ['bash','-c','if mount | grep -q $JOB_WORK/; then for i in $JOB_WORK/*keep $CRUNCH_TMP/task/*.keep; do /bin/fusermount -z -u $i; done; fi; sleep 1; rm -rf $JOB_WORK $CRUNCH_INSTALL $CRUNCH_TMP/task $CRUNCH_TMP/src*'] Mon Nov 17 10:34:25 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 Cleanup command exited 0 Mon Nov 17 10:34:25 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 Looking for version 3a31350c6265cb1135d3d4d40af436aae91a9894 from repository arvados Mon Nov 17 10:34:25 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 Using local repository '/home/peter/work/_arvados_internal.git' Mon Nov 17 10:34:25 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 Version 3a31350c6265cb1135d3d4d40af436aae91a9894 is commit 3a31350c6265cb1135d3d4d40af436aae91a9894 Mon Nov 17 10:34:25 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 Run install script on all workers Mon Nov 17 10:34:25 2014 starting: ['sh','-c','mkdir -p /tmp/crunch-job-1001/opt && cd /tmp/crunch-job-1001 && perl -'] Mon Nov 17 10:34:25 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 Install script exited 0 Mon Nov 17 10:34:25 2014 starting: ['/bin/sh','-ec',' if ! /usr/bin/docker.io images -q --no-trunc | grep -qxF 777ef687a8811f22fdd7c615be9356a92ce5f2150ff481bd368def31eae1bc15; then arv-get c04222e796767ae26d1096c7717162d6\\+1134\\/777ef687a8811f22fdd7c615be9356a92ce5f2150ff481bd368def31eae1bc15\\.tar | /usr/bin/docker.io load fi '] Mon Nov 17 10:34:25 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 script run-command Mon Nov 17 10:34:25 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 script_version 3a31350c6265cb1135d3d4d40af436aae91a9894 Mon Nov 17 10:34:25 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 script_parameters {"task.stdout":"foo.txt","command":[["ls"]]} Mon Nov 17 10:34:25 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 runtime_constraints {"max_tasks_per_node":0,"docker_image":"arvados/jobs"} Mon Nov 17 10:34:25 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 start level 0 Mon Nov 17 10:34:25 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 status: 0 done, 0 running, 1 todo Mon Nov 17 10:34:25 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 0 job_task 4n8aq-ot0gb-k613z8mfcng66kl Mon Nov 17 10:34:25 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 0 child 21511 started on localhost.1 Mon Nov 17 10:34:25 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 0 stderr starting: ['bash','-c','if [ -e /tmp/crunch-job-1001/task/localhost.1 ]; then rm -rf /tmp/crunch-job-1001/task/localhost.1; fi; mkdir -p /tmp/crunch-job-1001 /tmp/crunch-job-1001/work /tmp/crunch-job-1001/task/localhost.1 /tmp/crunch-job-1001/task/localhost.1.keep && cd /tmp/crunch-job-1001 && exec arv-mount --by-id --allow-other /tmp/crunch-job-1001/task/localhost.1.keep --exec crunchstat -cgroup-root=/sys/fs/cgroup -cgroup-parent=docker -cgroup-cid=/tmp/crunch-job-1001/4n8aq-ot0gb-k613z8mfcng66kl.cid -poll=10000 /usr/bin/docker.io run --rm=true --attach=stdout --attach=stderr --attach=stdin -i --user=crunch --cidfile=/tmp/crunch-job-1001/4n8aq-ot0gb-k613z8mfcng66kl.cid --sig-proxy $(ip -o address show scope global | gawk \'match($4, /^([0-9\\.:]+)\\//, x){print "--dns", x[1]}\') --volume=\\/tmp\\/crunch\\-job\\-1001\\/src\\:\\/tmp\\/crunch\\-job\\-1001\\/src\\:ro --volume=\\/tmp\\/crunch\\-job\\-1001\\/opt\\:\\/tmp\\/crunch\\-job\\-1001\\/opt\\:ro --volume=\\/tmp\\/crunch\\-job\\-1001\\/task\\/localhost\\.1\\.keep\\:\\/keep\\:ro --volume=/tmp --env=CRUNCH_TMP\\=\\/tmp\\/crunch\\-job\\-1001 --env=ARVADOS_API_HOST_INSECURE\\=true --env=TASK_TMPDIR\\=\\/tmp\\/crunch\\-job\\-task\\-work\\/localhost\\.1 --env=TASK_QSEQUENCE\\=0 --env=CRUNCH_SRC_COMMIT\\=3a31350c6265cb1135d3d4d40af436aae91a9894 --env=CRUNCH_JOB_BIN\\=\\/home\\/peter\\/work\\/arvados\\/services\\/crunch\\/crunch\\-job --env=CRUNCH_NODE_SLOTS\\=4 --env=ARVADOS_API_TOKEN\\=[...] --env=JOB_SCRIPT\\=run\\-command --env=TASK_KEEPMOUNT\\=\\/keep --env=CRUNCH_INSTALL\\=\\/tmp\\/crunch\\-job\\-1001\\/opt --env=TASK_WORK\\=\\/tmp\\/crunch\\-job\\-task\\-work\\/localhost\\.1 --env=JOB_PARAMETER_COMMAND\\=ARRAY\\(0x1e0b2a8\\) --env=TASK_SEQUENCE\\=0 --env=CRUNCH_SRC\\=\\/tmp\\/crunch\\-job\\-1001\\/src --env=JOB_PARAMETER_TASK\\.STDOUT\\=foo\\.txt --env=CRUNCH_WORK\\=\\/tmp\\/crunch\\-job\\-1001\\/work --env=TASK_SLOT_NUMBER\\=1 --env=CRUNCH_JOB_UUID\\=4n8aq\\-8i9sb\\-bwrsi9zltvbut0t --env=TASK_UUID\\=4n8aq\\-ot0gb\\-k613z8mfcng66kl --env=ARVADOS_API_HOST\\=petere1\\:3001 --env=TASK_SLOT_NODE\\=localhost --env=CRUNCH_REFRESH_TRIGGER\\=\\/tmp\\/crunch_refresh_trigger --env=CRUNCH_SRC_URL\\=\\/home\\/peter\\/work\\/_arvados_internal\\.git --env=JOB_UUID\\=4n8aq\\-8i9sb\\-bwrsi9zltvbut0t --env=JOB_WORK\\=\\/tmp\\/crunch\\-job\\-work --env=HOME\\=\\/tmp\\/crunch\\-job\\-task\\-work\\/localhost\\.1 777ef687a8811f22fdd7c615be9356a92ce5f2150ff481bd368def31eae1bc15 stdbuf --output=0 --error=0 perl - /tmp/crunch-job-1001/src/crunch_scripts/run-command'] Mon Nov 17 10:34:25 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 status: 0 done, 1 running, 0 todo Mon Nov 17 10:34:25 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 0 stderr crunchstat: Running [/usr/bin/docker.io run --rm=true --attach=stdout --attach=stderr --attach=stdin -i --user=crunch --cidfile=/tmp/crunch-job-1001/4n8aq-ot0gb-k613z8mfcng66kl.cid --sig-proxy --dns 10.13.4.125 --dns 172.17.42.1 --volume=/tmp/crunch-job-1001/src:/tmp/crunch-job-1001/src:ro --volume=/tmp/crunch-job-1001/opt:/tmp/crunch-job-1001/opt:ro --volume=/tmp/crunch-job-1001/task/localhost.1.keep:/keep:ro --volume=/tmp --env=CRUNCH_TMP=/tmp/crunch-job-1001 --env=ARVADOS_API_HOST_INSECURE=true --env=TASK_TMPDIR=/tmp/crunch-job-task-work/localhost.1 --env=TASK_QSEQUENCE=0 --env=CRUNCH_SRC_COMMIT=3a31350c6265cb1135d3d4d40af436aae91a9894 --env=CRUNCH_JOB_BIN=/home/peter/work/arvados/services/crunch/crunch-job --env=CRUNCH_NODE_SLOTS=4 --env=ARVADOS_API_TOKEN=4pf5q524ay4l7a47269nnml3wjfq4qtiiqqj1dd4nj46s00ane --env=JOB_SCRIPT=run-command --env=TASK_KEEPMOUNT=/keep --env=CRUNCH_INSTALL=/tmp/crunch-job-1001/opt --env=TASK_WORK=/tmp/crunch-job-task-work/localhost.1 --env=JOB_PARAMETER_COMMAND=ARRAY(0x1e0b2a8) --env=TASK_SEQUENCE=0 --env=CRUNCH_SRC=/tmp/crunch-job-1001/src --env=JOB_PARAMETER_TASK.STDOUT=foo.txt --env=CRUNCH_WORK=/tmp/crunch-job-1001/work --env=TASK_SLOT_NUMBER=1 --env=CRUNCH_JOB_UUID=4n8aq-8i9sb-bwrsi9zltvbut0t --env=TASK_UUID=4n8aq-ot0gb-k613z8mfcng66kl --env=ARVADOS_API_HOST=petere1:3001 --env=TASK_SLOT_NODE=localhost --env=CRUNCH_REFRESH_TRIGGER=/tmp/crunch_refresh_trigger --env=CRUNCH_SRC_URL=/home/peter/work/_arvados_internal.git --env=JOB_UUID=4n8aq-8i9sb-bwrsi9zltvbut0t --env=JOB_WORK=/tmp/crunch-job-work --env=HOME=/tmp/crunch-job-task-work/localhost.1 777ef687a8811f22fdd7c615be9356a92ce5f2150ff481bd368def31eae1bc15 stdbuf --output=0 --error=0 perl - /tmp/crunch-job-1001/src/crunch_scripts/run-command] Mon Nov 17 10:34:25 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 0 stderr crunchstat: reading stats from /sys/fs/cgroup/memory/memory.stat Mon Nov 17 10:34:25 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 0 stderr crunchstat: mem 8831631360 cache 29884416 swap 11093 pgmajfault 5604982784 rss Mon Nov 17 10:34:25 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 0 stderr crunchstat: reading stats from /sys/fs/cgroup/cpuacct/cpuacct.stat Mon Nov 17 10:34:25 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 0 stderr crunchstat: reading stats from /sys/fs/cgroup/cpuset/cpuset.cpus Mon Nov 17 10:34:25 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 0 stderr crunchstat: cpu 305513.6700 user 41036.8100 sys 4 cpus Mon Nov 17 10:34:25 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 0 stderr crunchstat: reading stats from /sys/fs/cgroup/blkio/blkio.io_service_bytes Mon Nov 17 10:34:25 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 0 stderr crunchstat: blkio:8:32 11837975040 write 3981105152 read Mon Nov 17 10:34:25 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 0 stderr crunchstat: blkio:8:16 3230437376 write 1393857536 read Mon Nov 17 10:34:25 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 0 stderr crunchstat: blkio:8:0 138842447872 write 5431453696 read Mon Nov 17 10:34:25 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 0 stderr crunchstat: reading stats from /sys/fs/cgroup/cpuacct/cgroup.procs Mon Nov 17 10:34:25 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 0 stderr crunchstat: net:docker0 167907635 tx 70811882 rx Mon Nov 17 10:34:25 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 0 stderr crunchstat: net:eth1 3221561126 tx 9785928672 rx Mon Nov 17 10:34:25 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 0 stderr /tmp/crunch-job-1001/src.lock: Permission denied at - line 26. Mon Nov 17 10:34:26 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 0 child 21511 on localhost.1 exit 13 success= Mon Nov 17 10:34:26 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 0 failure (#1, permanent) after 2 seconds Mon Nov 17 10:34:26 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 0 output Mon Nov 17 10:34:26 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 Every node has failed -- giving up on this round Mon Nov 17 10:34:26 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 wait for last 0 children to finish Mon Nov 17 10:34:26 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 status: 0 done, 0 running, 1 todo Mon Nov 17 10:34:26 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 Freeze not implemented Mon Nov 17 10:34:26 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 collate Mon Nov 17 10:34:27 2014 Collection saved as 'Saved at 2014-11-17 15:34:23 UTC by peter@peter' Mon Nov 17 10:34:27 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 log manifest is 67908e2246c333431dd2ba4398cf1e8c+83
Updated by Anonymous about 10 years ago
- Status changed from In Progress to Resolved
- % Done changed from 92 to 100
Applied in changeset arvados|commit:664919d58c3689cd9e0a25547ec1e02d9adda38c.