Bug #3824

[Crunch] crunch-job should create task execution environment inside docker container, not on worker host.

Added by Ward Vandewege about 5 years ago. Updated almost 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Start date:
10/10/2014
Due date:
% Done:

100%

Estimated time:
(Total: 12.00 h)
Story points:
2.0

Description

Cf. a5819ec1e48fba90658fcf676ffc50c1f216d484 and 0eb5711ade0f74e556b0a1c10909dbf0bdecb63f

Some refactoring of that section of crunch-job may be in order here.

Requirements:
  • JOB_WORK and TASK_WORK must not be the same directory.
  • JOB_WORK and TASK_WORK must exist when a crunch script starts.
  • JOB_WORK and TASK_WORK (and other crunch-specific directories) are not required to exist in the docker image being used. (A design goal is that non-Arvados-aware docker images can be used to run jobs. Expecting special tmp directories won't fly. But we might have to make some reasonable assumptions like "docker image has a world-writable /tmp".)
Probable approach:
  • Run as much stuff as possible inside the docker container. E.g., the "build script" should run inside the container. It should have no trouble creating the required directories.
  • Something like a5819ec1e48fba90658fcf676ffc50c1f216d484 but fix whatever failed to propagate log messages, and make sure to create CRUNCH_SRC in the container
    • docker run -a stdin -i seems to be necessary to get stdin through to the container. See #4044 notes.
    • We probably want to use docker run --sig-proxy too.

Subtasks

Task #4549: Review 3824-task-workResolvedPeter Amstutz

Task #4283: Make test caseResolvedTom Clegg

Task #4282: Create/setup task temp dirs at runtimeResolvedTom Clegg

Task #4548: Make sure JOB_TMP and TASK_TMP are writable by crunch userResolvedTom Clegg

Task #4325: Review 3824-crunch-container-setupResolvedTom Clegg

Task #4153: Talk to WardResolvedTom Clegg

Task #4550: Review 3824-docker-fixes (lower priority than 3824-task-work)ResolvedWard Vandewege


Related issues

Blocks Arvados - Story #3826: [Crunch] Display network activity in crunchstatResolved10/10/2014

Blocks Arvados - Bug #4470: [Crunch] tmpdir not writable running jobs in DockerResolved11/07/2014

Associated revisions

Revision c6a03a7a
Added by Tom Clegg almost 5 years ago

Merge branch '3824-crunch-container-setup' closes #3824

Revision 664919d5
Added by Tom Clegg almost 5 years ago

Merge branch '3824-task-work' closes #3824

Revision 5c375252 (diff)
Added by Ward Vandewege almost 5 years ago

A few wording tweaks.

refs #3824

Revision 48ccafc4
Added by Tom Clegg almost 5 years ago

Merge branch '3824-docker-fixes' refs #3824 refs #4186

Revision c33f4e36 (diff)
Added by Tom Clegg almost 5 years ago

Fix syntax error in whitespace. refs #3824

Revision 4143b481 (diff)
Added by Tom Clegg almost 5 years ago

Fix wrong variable assigned. refs #3824

History

#1 Updated by Ward Vandewege about 5 years ago

  • Subject changed from Restore job temp directory to crunch job. to [Crunch] Restore job temp directory to crunch job.

#2 Updated by Ward Vandewege about 5 years ago

  • Subject changed from [Crunch] Restore job temp directory to crunch job. to [Crunch] Restore job temp directory to crunch-job.

#3 Updated by Ward Vandewege about 5 years ago

  • Description updated (diff)

#4 Updated by Tom Clegg about 5 years ago

  • Story points set to 2.0

#5 Updated by Tom Clegg about 5 years ago

  • Description updated (diff)
  • Category set to Crunch

#6 Updated by Tom Clegg almost 5 years ago

  • Subject changed from [Crunch] Restore job temp directory to crunch-job. to [Crunch] Create task execution environment inside docker container, not on worker host.

#7 Updated by Tom Clegg almost 5 years ago

  • Subject changed from [Crunch] Create task execution environment inside docker container, not on worker host. to [Crunch] crunch-job should create task execution environment and run crunchstat inside docker container, not on worker host.

#8 Updated by Tom Clegg almost 5 years ago

  • Description updated (diff)

#9 Updated by Ward Vandewege almost 5 years ago

  • Target version changed from Arvados Future Sprints to 2014-10-29 sprint

#10 Updated by Tom Clegg almost 5 years ago

  • Subject changed from [Crunch] crunch-job should create task execution environment and run crunchstat inside docker container, not on worker host. to [Crunch] crunch-job should create task execution environment and collect stats inside docker container, not on worker host.

#11 Updated by Tom Clegg almost 5 years ago

  • Assigned To set to Tom Clegg

#12 Updated by Tom Clegg almost 5 years ago

  • Subject changed from [Crunch] crunch-job should create task execution environment and collect stats inside docker container, not on worker host. to [Crunch] crunch-job should create task execution environment inside docker container, not on worker host.

#13 Updated by Tom Clegg almost 5 years ago

  • Status changed from New to In Progress

#14 Updated by Tom Clegg almost 5 years ago

3824-crunch-container-setup @ 30bdc9b

  • JOB_WORK and TASK_WORK are docker data volumes, instead of bind mounts to a temp dir on the host.

While testing this I made some other fixes to docker and crunch-job:

  • Bigger tmpfs for keep volumes (big enough to store the arvados/jobs image)
  • In the compute image, install docker.io from the arvados package repository (otherwise "docker.io not found")
  • Run dnsmasq in the compute container (otherwise it can't run containerized jobs: crunch-job assumes the compute worker is a DNS cache)
  • Log() isn't right in a crunch-job child process: just die().
  • Show its stderr messages if the install script fails during task setup.
  • On Workbench jobs#show page, do not print a confusing message ("there are jobs ahead of this one") if queue_position is nil.

#15 Updated by Brett Smith almost 5 years ago

I'm working on building all the Docker infrastructure necessary to test this, but while I do, I wanted to provide my comments on a first pass of the diff at 30bdc9b.

  • If you want to elucidate the comment about TASK_KEEPMOUNT: you may want to double-check me against the git log, but IIRC, the reason we use /keep is that when you specify a volume to put in the container, Docker will effectively mkdir but not mkdir -p the destination. So /keep was a destination that was likely to work and unlikely to conflict with other tools. The default destination wasn't usable because it's under TASK_WORK, which we didn't install inside the container (instead we've been setting a different TASK_WORK).
    I'm not sure I understand your comment about tasks ignoring the value of TASK_KEEPMOUNT; the diff shows the previous code setting TASK_KEEPMOUNT=/keep.
  • I know this made token stripping easier, but I'm a little sad that the srun debugging lost the quoting around multi-word arguments. It wasn't perfect, but I think it was much less likely to be ambiguous compared to what's in the branch.
  • In shell_or_die, I'm not sure you can count on $! to stay useful as long as you do. I tested a little bit on my desktop:
    if (system("false") != 0) {
        print "first system: $!\n";
        my $exitstatus = sprintf("exit %d signal %d", $? >> 8, $? & 0x7f);
        print "set exitstatus: $!\n";
        open(BCSERR, ">/tmp/perlerr.log") or die;
        print "open: $!\n";
        system("true");
        print "second system: $!\n";
        die "$exitstatus";
    }
    

    For me, $! starts empty, then becomes "Inappropriate ioctl for device" after the open call. I think if you want this information, you should probably stash it in $exitstatus.

#16 Updated by Brett Smith almost 5 years ago

Brett Smith wrote:

I'm working on building all the Docker infrastructure necessary to test this, but while I do, I wanted to provide my comments on a first pass of the diff at 30bdc9b.

If I'm going to test this, I might need a hint on how to get into the Docker universe. I have it all built and started with arvdock, and it looks like things are running normally. But the only way I know to get an API token is to log in through Workbench. When I click the log in button, I land on a page at https://172.17.0.117/auth/joshid/callback?return_to=http%3A%2F%2Flocalhost%3A9899%2Fusers%2Fwelcomex%x%…, and it's a 500. It looks like #4296, except starting the login process over again doesn't help; I always get a 500.

Any hints? What are you doing on your setup between arvdock and submitting a job to the cluster?

#17 Updated by Tom Clegg almost 5 years ago

Brett Smith wrote:

I'm working on building all the Docker infrastructure necessary to test this, but while I do, I wanted to provide my comments on a first pass of the diff at 30bdc9b.

  • If you want to elucidate the comment about TASK_KEEPMOUNT: you may want to double-check me against the git log, but IIRC, the reason we use /keep is that when you specify a volume to put in the container, Docker will effectively mkdir but not mkdir -p the destination. So /keep was a destination that was likely to work and unlikely to conflict with other tools. The default destination wasn't usable because it's under TASK_WORK, which we didn't install inside the container (instead we've been setting a different TASK_WORK).
    I'm not sure I understand your comment about tasks ignoring the value of TASK_KEEPMOUNT; the diff shows the previous code setting TASK_KEEPMOUNT=/keep.

Ah. "Always mount it in /keep" seemed like an attempt to make it predictable for the benefit of someone who didn't want to bother looking at $TASK_KEEPMOUNT. I didn't want to get into the question of who has started hard-coding /keep instead of using $TASK_KEEPMOUNT now that it looks sort of predictable, even if that's not why it originally went that way. (I see run-command defaults to /keep if TASK_KEEPMOUNT is not set.)

I tried this (with Docker version 1.2.0-dev, build dc243c8) with success, so the mkdir -p issue, if any, seems fixed now:

# docker.io run -it --volume=/var/tmp:/foo/bar/baz/waz:ro ubuntu ls -la /foo/bar/baz/waz/
total 0
drwxrwxrwt 2 root root  6 Oct 28 05:17 .
drwxr-xr-x 3 root root 16 Oct 28 05:22 ..

It looks like the "use something outside /tmp/crunch-job" trend started (with /mnt) in 0eb77fba3f7de714a7edef1c57491f3c285f6d67, whose comment attributes it to uid mapping problems.

That makes sense for the *TMP directories, which have to be writable (although I think I've solved that simply by using data volumes instead of host mounts). I'm not sure what the uid mapping problem is for the keep mount, since it's read-only and we use --allow-other and have user_allow_other in /etc/fuse.conf ...

I notice one read-only mount inside another is not so easy:

docker.io run -it --volume=/var/tmp:/foo/bar/baz/waz:ro --volume=/var:/foo/bar:ro ...
setup mount namespace creating new bind mount target mkdir /var/lib/docker/aufs/mnt/f24df0c3[...]/foo/bar/baz: [...]: read-only file system

But overlapping writable mounts seem ok.

  • I know this made token stripping easier, but I'm a little sad that the srun debugging lost the quoting around multi-word arguments. It wasn't perfect, but I think it was much less likely to be ambiguous compared to what's in the branch.

Good point. I'd like to get better logging here because we're currently relying on crunchstat, of all things, to spit out the portion of the "start task" command that comes after crunchstat. I didn't like the way the existing quoting assumed (incorrectly) that there were no single quotes in any $args. New version:

  my $show_cmd = join(" ", map {
    if (/[\s\"]/) {
      s/[\"\$\\]/\\$&/g;
      "\"$_\"";
    } else {
      $_;
    }} @{$args});

Theoretically \Q\E would be correct, but that makes it even less human-readable, and it's already bad enough. (By the time we get here, we've already done a bunch of \Q\E quoting, and it escapes stuff like [-:=/] ...)

  • In shell_or_die, I'm not sure you can count on $! to stay useful as long as you do. I tested a little bit on my desktop:
    [...]
    For me, $! starts empty, then becomes "Inappropriate ioctl for device" after the open call. I think if you want this information, you should probably stash it in $exitstatus.

Oops. I looked at man perlfunc and it seems $! is not interesting after system() anyway: "you can check all possible failure modes by inspecting $? like this:" ... so I removed that part of the dying breath.

Logging updates @ fb1bf9f

#18 Updated by Tom Clegg almost 5 years ago

Brett Smith wrote:

When I click the log in button, I land on a page at https://172.17.0.117/auth/joshid/callback?return_to=http%3A%2F%2Flocalhost%3A9899%2Fusers%2Fwelcomex%x%…, and it's a 500. It looks like #4296, except starting the login process over again doesn't help; I always get a 500.

Hm, I haven't hit that one. The main hurdle I've had is that I'm not running docker + browser on the same box, so 172.17.* don't work. After clicking "log in" I have to change the apiserver login url in my browser from https://172.17.*/login to the 192.168.*:9900 address that gets port-forwarded, and it works from there.

Another way: arvdock helpfully clobbers your existing ~/.config/arvados/settings.conf with root credentials and host settings for your docker api server, so "arv user current" should say you're root...

#19 Updated by Brett Smith almost 5 years ago

Tom Clegg wrote:

Brett Smith wrote:

  • If you want to elucidate the comment about TASK_KEEPMOUNT: you may want to double-check me against the git log, but IIRC, the reason we use /keep is that when you specify a volume to put in the container, Docker will effectively mkdir but not mkdir -p the destination.

Ah. "Always mount it in /keep" seemed like an attempt to make it predictable for the benefit of someone who didn't want to bother looking at $TASK_KEEPMOUNT.

I realize it has that effect, but I never understood that as the motivation for any of the related changes that led us up to this point.

It looks like the "use something outside /tmp/crunch-job" trend started (with /mnt) in 0eb77fba3f7de714a7edef1c57491f3c285f6d67, whose comment attributes it to uid mapping problems.

That makes sense for the *TMP directories, which have to be writable (although I think I've solved that simply by using data volumes instead of host mounts). I'm not sure what the uid mapping problem is for the keep mount, since it's read-only and we use --allow-other and have user_allow_other in /etc/fuse.conf ...

I think you already know this, but just to make sure we're on the same page, note that --allow-other doesn't override POSIX permissions. It's the other way around: if you don't allow others to read the mount, then they'll be forbidden no matter what the POSIX permisions say.

But that doesn't detract from your main point that UID mapping should never have been an issue for the FUSE mount, where everything's world-readable. Since the same commit changes both the temp directories and the FUSE mount, it looks to me like UID mapping was the motivation for the tempdir changes, and then moving the Keep mount was sort of a side effect of that.

I agree that the volume approach should make the UID mapping issue moot.

  • I know this made token stripping easier, but I'm a little sad that the srun debugging lost the quoting around multi-word arguments. It wasn't perfect, but I think it was much less likely to be ambiguous compared to what's in the branch.

Good point. I'd like to get better logging here because we're currently relying on crunchstat, of all things, to spit out the portion of the "start task" command that comes after crunchstat. I didn't like the way the existing quoting assumed (incorrectly) that there were no single quotes in any $args. New version:

Would it make sense to just use Data::Dumper here? It seems like that's what we really want.

I looked at man perlfunc and it seems $! is not interesting after system() anyway: "you can check all possible failure modes by inspecting $? like this:" ... so I removed that part of the dying breath.

$? may report everything, but FWIW, $! does seem to have a useful string in failure modes beyond "child exited nonzero." For example, it says "No such file or directory" if you try to run something that doesn't exist. I'm fine with the current version, just sharing what I saw in my testing.

#20 Updated by Brett Smith almost 5 years ago

I got the hash job to run correctly inside a Docker container inside my Docker cluster. That's pretty awesome. Thanks for helping me through it. A couple of things this shook out from the branch:

  • As discussed on IRC, the logging in srun can't mutate the strings in $args, lest it interfere with shell quoting or other semantics.
  • There's no need to add our apt source in compute/Dockerfile, because that's already taken care of by arvados/base, which we inherit from.

And then some longer-standing issues in the Dockerfiles. I'm okay with seeing these addressed in the branch, or separately:

  • Running apt-get update more than once per Dockerfile is redundant. I suggest trimming extraneous calls at least in the touched Dockerfiles.
  • We should set permissions on /etc/fuse.conf when we add it to the compute image, because it needs to be world-readable and the builder's umask might not be.

Thanks.

#21 Updated by Tom Clegg almost 5 years ago

  • Target version changed from 2014-10-29 sprint to 2014-11-19 sprint

#22 Updated by Tom Clegg almost 5 years ago

Using Data::Dumper, the log message now looks like this:

Thu Oct 30 17:07:28 2014 2y486-8i9sb-i7pbmv8h0xd8efc 20198 1 stderr starting: ['srun','--nodelist=compute1','-n1','-c1','-N1','-D','/tmp','--job-name=2y486-8i9sb-i7pbmv8h0xd8efc.1.20477','bash','-c','if [ -e /tmp/crunch-job/task/compute1.1 ]; then rm -rf /tmp/crunch-job/task/compute1.1; fi; mkdir -p /tmp/crunch-job /tmp/crunch-job/work /tmp/crunch-job/task/compute1.1 /tmp/crunch-job/task/compute1.1.keep && cd /tmp/crunch-job && perl -&& exec arv-mount --by-id --allow-other /tmp/crunch-job/task/compute1.1.keep --exec crunchstat -cgroup-root=/sys/fs/cgroup -cgroup-parent=docker -cgroup-cid=/tmp/crunch-job/2y486-ot0gb-j601wipo850mzm3.cid -poll=10000 /usr/bin/docker.io run --rm=true --attach=stdout --attach=stderr --attach=stdin -i --user=crunch --cidfile=/tmp/crunch-job/2y486-ot0gb-j601wipo850mzm3.cid --sig-proxy $(ip -o address show scope global | gawk \'match($4, /^([0-9\\.:]+)\\//, x){print "--dns", x[1]}\') --volume=\\/tmp\\/crunch\\-job\\/src\\:\\/tmp\\/crunch\\-job\\/src\\:ro --volume=\\/tmp\\/crunch\\-job\\/opt\\:\\/tmp\\/crunch\\-job\\/opt\\:ro --volume=\\/tmp\\/crunch\\-job\\/task\\/compute1\\.1\\.keep\\:\\/keep\\:ro --volume=\\/tmp\\/crunch\\-job\\/task\\/compute1\\.1 --volume=\\/tmp\\/crunch\\-job\\/work --env=CRUNCH_JOB_BIN\\=\\/usr\\/src\\/arvados\\/services\\/crunch\\/crunch\\-job --env=TASK_SEQUENCE\\=1 --env=TASK_KEEPMOUNT\\=\\/keep --env=CRUNCH_SRC_COMMIT\\=e2fe6c0e5c1c62a37e03519590c04a5186a2cc9b --env=TASK_QSEQUENCE\\=1 --env=CRUNCH_INSTALL\\=\\/tmp\\/crunch\\-job\\/opt --env=CRUNCH_REFRESH_TRIGGER\\=\\/tmp\\/crunch_refresh_trigger --env=ARVADOS_API_TOKEN\\=[...] --env=CRUNCH_WORK\\=\\/tmp\\/crunch\\-job\\/work --env=CRUNCH_TMP\\=\\/tmp\\/crunch\\-job --env=TASK_TMPDIR\\=\\/tmp\\/crunch\\-job\\/task\\/compute1\\.1 --env=JOB_UUID\\=2y486\\-8i9sb\\-i7pbmv8h0xd8efc --env=CRUNCH_JOB_UUID\\=2y486\\-8i9sb\\-i7pbmv8h0xd8efc --env=TASK_SLOT_NUMBER\\=1 --env=CRUNCH_SRC_URL\\=\\/var\\/lib\\/arvados\\/internal\\.git --env=TASK_SLOT_NODE\\=compute1 --env=JOB_PARAMETER_INPUT\\=83367e8913dcec0bf3fc25ed5a27eacb\\+49 --env=ARVADOS_API_HOST_INSECURE\\=yes --env=JOB_SCRIPT\\=hash --env=CRUNCH_NODE_SLOTS\\=1 --env=TASK_WORK\\=\\/tmp\\/crunch\\-job\\/task\\/compute1\\.1 --env=ARVADOS_API_HOST\\=api --env=JOB_WORK\\=\\/tmp\\/crunch\\-job\\/work --env=TASK_UUID\\=2y486\\-ot0gb\\-j601wipo850mzm3 --env=CRUNCH_SRC\\=\\/tmp\\/crunch\\-job\\/src --env=HOME\\=\\/tmp\\/crunch\\-job\\/task\\/compute1\\.1 67efcf13b3cc5e240cabf938e7d7c72fb033425d2e9b4b1d65e3a1e96e6bdaad stdbuf --output=0 --error=0 /tmp/crunch-job/src/crunch_scripts/hash']

Removed the redundant "add arvados repo" and just added docker.io to the existing apt-get install command instead.

Brought back $! (safely, via my $err).

Now at da6674e

#23 Updated by Tom Clegg almost 5 years ago

Brett Smith wrote:

And then some longer-standing issues in the Dockerfiles. I'm okay with seeing these addressed in the branch, or separately:

  • Running apt-get update more than once per Dockerfile is redundant. I suggest trimming extraneous calls at least in the touched Dockerfiles.

Hm, they're redundant if you build all of the images at once, but I'm not confident we should assume that. Would "apt-get update unless /var/cache/apt/pkgcache.bin is <1 hour old" be better? (Trying to choose an ideal threshold mostly makes me inclined to err on the side of caution and leave apt-get update in there for the sake of predictability.)

Added #4366 so we can deliberate further. :)

  • We should set permissions on /etc/fuse.conf when we add it to the compute image, because it needs to be world-readable and the builder's umask might not be.

Done in 1bcfc05

#24 Updated by Tom Clegg almost 5 years ago

Tom Clegg wrote:

Brett Smith wrote:

And then some longer-standing issues in the Dockerfiles. I'm okay with seeing these addressed in the branch, or separately:

  • Running apt-get update more than once per Dockerfile is redundant. I suggest trimming extraneous calls at least in the touched Dockerfiles.

Hm, they're redundant if you build all of the images at once, but I'm not confident we should assume that. Would "apt-get update unless /var/cache/apt/pkgcache.bin is <1 hour old" be better? (Trying to choose an ideal threshold mostly makes me inclined to err on the side of caution and leave apt-get update in there for the sake of predictability.)

Added #4366 so we can deliberate further. :)

Sorry, I misread your comment. Removed redundant "apt-get update"s from Dockerfiles that had more than one (base and compute). Also went through (most of) the Dockerfiles to make them use "apt-get update -qq" and "apt-get install -qqy" for consistency and greppability. (Greppibility?)

Removed #4366.

Also, added GPG key to base image so latest RVM can build -- see https://github.com/wayneeseguin/rvm/commit/7386864bc5e47de1d5cbbd339f9e008d0811c181

Now at c199c0c

#25 Updated by Brett Smith almost 5 years ago

Reviewing c199c0c, and my only concern is the TASK_KEEPMOUNT comment. Looking at it again, I feel more strongly about it: crunch-job has always set TASK_KEEPMOUNT correctly, and Crunch authors have consistently been told to use it. Any scripts that do otherwise are buggy, and I worry that the comment might give future readers the idea that this is a corner case crunch-job must support in perpetuity. Everything else looks good to me.

#26 Updated by Tom Clegg almost 5 years ago

Brett Smith wrote:

Reviewing c199c0c, and my only concern is the TASK_KEEPMOUNT comment. Looking at it again, I feel more strongly about it: crunch-job has always set TASK_KEEPMOUNT correctly, and Crunch authors have consistently been told to use it. Any scripts that do otherwise are buggy, and I worry that the comment might give future readers the idea that this is a corner case crunch-job must support in perpetuity. Everything else looks good to me.

Indeed, when I wrote that comment I meant to mock the idea of making the path predictable, but I see now that the sarcasm is unhelpful. I've fixed the comment so it says clearly (I think!) to do the right thing. b39e2b4

+      # Currently, we make arv-mount's mount point appear at /keep
+      # inside the container (instead of using the same path as the
+      # host like we do with CRUNCH_SRC and CRUNCH_INSTALL). However,
+      # crunch scripts and utilities must not rely on this. They must
+      # use $TASK_KEEPMOUNT.
       $command .= "--volume=\Q$ENV{TASK_KEEPMOUNT}:/keep:ro\E ";
       $ENV{TASK_KEEPMOUNT} = "/keep";

#27 Updated by Anonymous almost 5 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 78 to 100

Applied in changeset arvados|commit:c6a03a7abff947dc8242e8be18b4b5e6920a3e4a.

#28 Updated by Tom Clegg almost 5 years ago

  • Status changed from Resolved to In Progress

#29 Updated by Peter Amstutz almost 5 years ago

Reviewing 1137425

I still can't run a job inside Docker:

Mon Nov 17 10:34:22 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477  check slurm allocation
Mon Nov 17 10:34:22 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477  node localhost - 4 slots
Mon Nov 17 10:34:22 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477  start
Mon Nov 17 10:34:22 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477  Clean work dirs
Mon Nov 17 10:34:22 2014 starting: ['bash','-c','if mount | grep -q $JOB_WORK/; then for i in $JOB_WORK/*keep $CRUNCH_TMP/task/*.keep; do /bin/fusermount -z -u $i; done; fi; sleep 1; rm -rf $JOB_WORK $CRUNCH_INSTALL $CRUNCH_TMP/task $CRUNCH_TMP/src*']
Mon Nov 17 10:34:25 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477  Cleanup command exited 0
Mon Nov 17 10:34:25 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477  Looking for version 3a31350c6265cb1135d3d4d40af436aae91a9894 from repository arvados
Mon Nov 17 10:34:25 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477  Using local repository '/home/peter/work/_arvados_internal.git'
Mon Nov 17 10:34:25 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477  Version 3a31350c6265cb1135d3d4d40af436aae91a9894 is commit 3a31350c6265cb1135d3d4d40af436aae91a9894
Mon Nov 17 10:34:25 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477  Run install script on all workers
Mon Nov 17 10:34:25 2014 starting: ['sh','-c','mkdir -p /tmp/crunch-job-1001/opt && cd /tmp/crunch-job-1001 && perl -']
Mon Nov 17 10:34:25 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477  Install script exited 0
Mon Nov 17 10:34:25 2014 starting: ['/bin/sh','-ec',' if ! /usr/bin/docker.io images -q --no-trunc | grep -qxF 777ef687a8811f22fdd7c615be9356a92ce5f2150ff481bd368def31eae1bc15; then     arv-get c04222e796767ae26d1096c7717162d6\\+1134\\/777ef687a8811f22fdd7c615be9356a92ce5f2150ff481bd368def31eae1bc15\\.tar | /usr/bin/docker.io load fi ']
Mon Nov 17 10:34:25 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477  script run-command
Mon Nov 17 10:34:25 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477  script_version 3a31350c6265cb1135d3d4d40af436aae91a9894
Mon Nov 17 10:34:25 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477  script_parameters {"task.stdout":"foo.txt","command":[["ls"]]}
Mon Nov 17 10:34:25 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477  runtime_constraints {"max_tasks_per_node":0,"docker_image":"arvados/jobs"}
Mon Nov 17 10:34:25 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477  start level 0
Mon Nov 17 10:34:25 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477  status: 0 done, 0 running, 1 todo
Mon Nov 17 10:34:25 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 0 job_task 4n8aq-ot0gb-k613z8mfcng66kl
Mon Nov 17 10:34:25 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 0 child 21511 started on localhost.1
Mon Nov 17 10:34:25 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 0 stderr starting: ['bash','-c','if [ -e /tmp/crunch-job-1001/task/localhost.1 ]; then rm -rf /tmp/crunch-job-1001/task/localhost.1; fi; mkdir -p /tmp/crunch-job-1001 /tmp/crunch-job-1001/work /tmp/crunch-job-1001/task/localhost.1 /tmp/crunch-job-1001/task/localhost.1.keep && cd /tmp/crunch-job-1001 && exec arv-mount --by-id --allow-other /tmp/crunch-job-1001/task/localhost.1.keep --exec crunchstat -cgroup-root=/sys/fs/cgroup -cgroup-parent=docker -cgroup-cid=/tmp/crunch-job-1001/4n8aq-ot0gb-k613z8mfcng66kl.cid -poll=10000 /usr/bin/docker.io run --rm=true --attach=stdout --attach=stderr --attach=stdin -i --user=crunch --cidfile=/tmp/crunch-job-1001/4n8aq-ot0gb-k613z8mfcng66kl.cid --sig-proxy $(ip -o address show scope global |               gawk \'match($4, /^([0-9\\.:]+)\\//, x){print "--dns", x[1]}\') --volume=\\/tmp\\/crunch\\-job\\-1001\\/src\\:\\/tmp\\/crunch\\-job\\-1001\\/src\\:ro --volume=\\/tmp\\/crunch\\-job\\-1001\\/opt\\:\\/tmp\\/crunch\\-job\\-1001\\/opt\\:ro --volume=\\/tmp\\/crunch\\-job\\-1001\\/task\\/localhost\\.1\\.keep\\:\\/keep\\:ro --volume=/tmp --env=CRUNCH_TMP\\=\\/tmp\\/crunch\\-job\\-1001 --env=ARVADOS_API_HOST_INSECURE\\=true --env=TASK_TMPDIR\\=\\/tmp\\/crunch\\-job\\-task\\-work\\/localhost\\.1 --env=TASK_QSEQUENCE\\=0 --env=CRUNCH_SRC_COMMIT\\=3a31350c6265cb1135d3d4d40af436aae91a9894 --env=CRUNCH_JOB_BIN\\=\\/home\\/peter\\/work\\/arvados\\/services\\/crunch\\/crunch\\-job --env=CRUNCH_NODE_SLOTS\\=4 --env=ARVADOS_API_TOKEN\\=[...] --env=JOB_SCRIPT\\=run\\-command --env=TASK_KEEPMOUNT\\=\\/keep --env=CRUNCH_INSTALL\\=\\/tmp\\/crunch\\-job\\-1001\\/opt --env=TASK_WORK\\=\\/tmp\\/crunch\\-job\\-task\\-work\\/localhost\\.1 --env=JOB_PARAMETER_COMMAND\\=ARRAY\\(0x1e0b2a8\\) --env=TASK_SEQUENCE\\=0 --env=CRUNCH_SRC\\=\\/tmp\\/crunch\\-job\\-1001\\/src --env=JOB_PARAMETER_TASK\\.STDOUT\\=foo\\.txt --env=CRUNCH_WORK\\=\\/tmp\\/crunch\\-job\\-1001\\/work --env=TASK_SLOT_NUMBER\\=1 --env=CRUNCH_JOB_UUID\\=4n8aq\\-8i9sb\\-bwrsi9zltvbut0t --env=TASK_UUID\\=4n8aq\\-ot0gb\\-k613z8mfcng66kl --env=ARVADOS_API_HOST\\=petere1\\:3001 --env=TASK_SLOT_NODE\\=localhost --env=CRUNCH_REFRESH_TRIGGER\\=\\/tmp\\/crunch_refresh_trigger --env=CRUNCH_SRC_URL\\=\\/home\\/peter\\/work\\/_arvados_internal\\.git --env=JOB_UUID\\=4n8aq\\-8i9sb\\-bwrsi9zltvbut0t --env=JOB_WORK\\=\\/tmp\\/crunch\\-job\\-work --env=HOME\\=\\/tmp\\/crunch\\-job\\-task\\-work\\/localhost\\.1 777ef687a8811f22fdd7c615be9356a92ce5f2150ff481bd368def31eae1bc15 stdbuf --output=0 --error=0 perl - /tmp/crunch-job-1001/src/crunch_scripts/run-command']
Mon Nov 17 10:34:25 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477  status: 0 done, 1 running, 0 todo
Mon Nov 17 10:34:25 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 0 stderr crunchstat: Running [/usr/bin/docker.io run --rm=true --attach=stdout --attach=stderr --attach=stdin -i --user=crunch --cidfile=/tmp/crunch-job-1001/4n8aq-ot0gb-k613z8mfcng66kl.cid --sig-proxy --dns 10.13.4.125 --dns 172.17.42.1 --volume=/tmp/crunch-job-1001/src:/tmp/crunch-job-1001/src:ro --volume=/tmp/crunch-job-1001/opt:/tmp/crunch-job-1001/opt:ro --volume=/tmp/crunch-job-1001/task/localhost.1.keep:/keep:ro --volume=/tmp --env=CRUNCH_TMP=/tmp/crunch-job-1001 --env=ARVADOS_API_HOST_INSECURE=true --env=TASK_TMPDIR=/tmp/crunch-job-task-work/localhost.1 --env=TASK_QSEQUENCE=0 --env=CRUNCH_SRC_COMMIT=3a31350c6265cb1135d3d4d40af436aae91a9894 --env=CRUNCH_JOB_BIN=/home/peter/work/arvados/services/crunch/crunch-job --env=CRUNCH_NODE_SLOTS=4 --env=ARVADOS_API_TOKEN=4pf5q524ay4l7a47269nnml3wjfq4qtiiqqj1dd4nj46s00ane --env=JOB_SCRIPT=run-command --env=TASK_KEEPMOUNT=/keep --env=CRUNCH_INSTALL=/tmp/crunch-job-1001/opt --env=TASK_WORK=/tmp/crunch-job-task-work/localhost.1 --env=JOB_PARAMETER_COMMAND=ARRAY(0x1e0b2a8) --env=TASK_SEQUENCE=0 --env=CRUNCH_SRC=/tmp/crunch-job-1001/src --env=JOB_PARAMETER_TASK.STDOUT=foo.txt --env=CRUNCH_WORK=/tmp/crunch-job-1001/work --env=TASK_SLOT_NUMBER=1 --env=CRUNCH_JOB_UUID=4n8aq-8i9sb-bwrsi9zltvbut0t --env=TASK_UUID=4n8aq-ot0gb-k613z8mfcng66kl --env=ARVADOS_API_HOST=petere1:3001 --env=TASK_SLOT_NODE=localhost --env=CRUNCH_REFRESH_TRIGGER=/tmp/crunch_refresh_trigger --env=CRUNCH_SRC_URL=/home/peter/work/_arvados_internal.git --env=JOB_UUID=4n8aq-8i9sb-bwrsi9zltvbut0t --env=JOB_WORK=/tmp/crunch-job-work --env=HOME=/tmp/crunch-job-task-work/localhost.1 777ef687a8811f22fdd7c615be9356a92ce5f2150ff481bd368def31eae1bc15 stdbuf --output=0 --error=0 perl - /tmp/crunch-job-1001/src/crunch_scripts/run-command]
Mon Nov 17 10:34:25 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 0 stderr crunchstat: reading stats from /sys/fs/cgroup/memory/memory.stat
Mon Nov 17 10:34:25 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 0 stderr crunchstat: mem 8831631360 cache 29884416 swap 11093 pgmajfault 5604982784 rss
Mon Nov 17 10:34:25 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 0 stderr crunchstat: reading stats from /sys/fs/cgroup/cpuacct/cpuacct.stat
Mon Nov 17 10:34:25 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 0 stderr crunchstat: reading stats from /sys/fs/cgroup/cpuset/cpuset.cpus
Mon Nov 17 10:34:25 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 0 stderr crunchstat: cpu 305513.6700 user 41036.8100 sys 4 cpus
Mon Nov 17 10:34:25 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 0 stderr crunchstat: reading stats from /sys/fs/cgroup/blkio/blkio.io_service_bytes
Mon Nov 17 10:34:25 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 0 stderr crunchstat: blkio:8:32 11837975040 write 3981105152 read
Mon Nov 17 10:34:25 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 0 stderr crunchstat: blkio:8:16 3230437376 write 1393857536 read
Mon Nov 17 10:34:25 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 0 stderr crunchstat: blkio:8:0 138842447872 write 5431453696 read
Mon Nov 17 10:34:25 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 0 stderr crunchstat: reading stats from /sys/fs/cgroup/cpuacct/cgroup.procs
Mon Nov 17 10:34:25 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 0 stderr crunchstat: net:docker0 167907635 tx 70811882 rx
Mon Nov 17 10:34:25 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 0 stderr crunchstat: net:eth1 3221561126 tx 9785928672 rx
Mon Nov 17 10:34:25 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 0 stderr /tmp/crunch-job-1001/src.lock: Permission denied at - line 26.
Mon Nov 17 10:34:26 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 0 child 21511 on localhost.1 exit 13 success=
Mon Nov 17 10:34:26 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 0 failure (#1, permanent) after 2 seconds
Mon Nov 17 10:34:26 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477 0 output
Mon Nov 17 10:34:26 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477  Every node has failed -- giving up on this round
Mon Nov 17 10:34:26 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477  wait for last 0 children to finish
Mon Nov 17 10:34:26 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477  status: 0 done, 0 running, 1 todo
Mon Nov 17 10:34:26 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477  Freeze not implemented
Mon Nov 17 10:34:26 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477  collate
Mon Nov 17 10:34:27 2014 Collection saved as 'Saved at 2014-11-17 15:34:23 UTC by peter@peter'
Mon Nov 17 10:34:27 2014 4n8aq-8i9sb-bwrsi9zltvbut0t 21477  log manifest is 67908e2246c333431dd2ba4398cf1e8c+83

#30 Updated by Peter Amstutz almost 5 years ago

commit:5058834 3824-task-work LGTM

#31 Updated by Anonymous almost 5 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 92 to 100

Applied in changeset arvados|commit:664919d58c3689cd9e0a25547ec1e02d9adda38c.

Also available in: Atom PDF