Story #7582

[CWL] binary run-command shim for CWL

Added by Peter Amstutz almost 4 years ago. Updated almost 4 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
10/16/2015
Due date:
% Done:

0%

Estimated time:
(Total: 0.00 h)
Story points:
2.0

Description

CWL jobs are allowed to specify arbitrary Docker containers. In particular, this means depending on our existing run-command script (which depends on the Arvados Python SDK, which itself has long chain of further dependencies) is a non-starter, because we can't assume a friendly environment inside the container.

Implement a simplified run-command (crunch-runner?) in Go that is compiled to a binary with minimum external dependencies that can be loaded into arbitrary containers. This will act as a shim to run the command line generated by CWL and communicate the results to Arvados.

Required features:

  • Run arbitrary program in a subprocess
  • Set environment variables for subprocess
  • Redirect stdin/stdout
  • Optionally symlink input files to output directory
  • Determine success/failure based on exit code
  • Upload contents of output directory to collection
  • Communicate task success/failure to API server along with output collection.

Subtasks

Task #7638: Review 7582-crunch-runnerResolvedPeter Amstutz

Task #7588: Implement crunch-runner in GoResolvedPeter Amstutz

Task #7606: Review 7582-run-any-docker-containerResolvedTom Clegg


Related issues

Related to Arvados - Story #9397: [Crunch2] Support prepopulating the output directory - CWL InitialWorkDirRequirementResolved01/31/2017

Associated revisions

Revision dbf51c6a
Added by Peter Amstutz almost 4 years ago

Merge branch '7582-run-any-docker-container' refs #7582

Revision f4bc8637
Added by Peter Amstutz almost 4 years ago

Merge branch '7582-crunch-runner' refs #7582

Revision 0975ea0b (diff)
Added by Peter Amstutz almost 4 years ago

Update Gemfile pin on arvados-cli to ensure latest crunch-job refs #7582

Revision 77460b21 (diff)
Added by Peter Amstutz almost 4 years ago

Fix typo in arvados-cli version number refs #7582

Revision 0477296d (diff)
Added by Tom Clegg almost 4 years ago

Sync Gemfile.lock to current Gemfile.

amends 77460b2190e84df4178c25f014bbf136d559922e
refs #7582

History

#1 Updated by Peter Amstutz almost 4 years ago

  • Tracker changed from Bug to Story

#2 Updated by Peter Amstutz almost 4 years ago

  • Subject changed from [CWL] binary run-command shim to [CWL] binary run-command shim for CWL

#3 Updated by Peter Amstutz almost 4 years ago

  • Description updated (diff)

#4 Updated by Peter Amstutz almost 4 years ago

  • Story points set to 2.0

#5 Updated by Peter Amstutz almost 4 years ago

  • Description updated (diff)

#6 Updated by Peter Amstutz almost 4 years ago

  • Story points changed from 2.0 to 3.0

#7 Updated by Brett Smith almost 4 years ago

  • Target version set to 2015-10-28 sprint

#8 Updated by Brett Smith almost 4 years ago

  • Assigned To set to Peter Amstutz

#9 Updated by Peter Amstutz almost 4 years ago

  • Status changed from New to In Progress
  • Story points changed from 3.0 to 2.0

#10 Updated by Peter Amstutz almost 4 years ago

Turns out this does require a few tweaks to crunch-job, but not major surgery.

Branch 7582-run-any-docker-container is up for review with the following changes:

  • Don't use --user=crunch because (a) there might not be a user called crunch in the container and (b) the crunch user could be aliased to root, so it's a terrible security bug. This fixes it to use the UID of the calling process on the compute node for the user inside the container.
  • Add "no_provision" to runtime constraints, when enabled it skips the "perl - ..." provisioning and runs the supplied crunch script directly.

#11 Updated by Tom Clegg almost 4 years ago

Peter Amstutz wrote:

This fixes it to use the UID of the calling process on the compute node for the user inside the container.

This means the contained process always runs with whatever UID the "crunch" user has on the worker node. This means either:
  • containers must be built so they can run jobs as any UID, or
  • containers must be built so they can run jobs using a standard UID, say 1000, and all worker nodes must give the "crunch" user that same numeric UID.

In the former case, we could just choose UID 1000 and it would work.

In the latter case, we could just say --user=1000 and not be sensitive to the UID assignments on the worker node.

BTW, I don't see how this change fixes any security holes. Giving passwordless sudo to a user with UID=1000 is not significantly harder than giving the crunch user UID=0...?

Seems like we should just omit the --user argument entirely if we think we're in "leave my container alone" mode. (I suppose a lot of jobs would break if we started treating "no arvados sdk version specified" as "leave my container alone" mode...?)

  • Add "no_provision" to runtime constraints, when enabled it skips the "perl - ..." provisioning and runs the supplied crunch script directly.

What's the intent behind this change? It seems to be a misnomer, in that it doesn't seem to skip the provisioning, and in fact still relies on the provisioning -- it just skips the wrapper around the task process. Is the idea here to drop the container-must-have-perl dependency?

#12 Updated by Peter Amstutz almost 4 years ago

Tom Clegg wrote:

Peter Amstutz wrote:

This fixes it to use the UID of the calling process on the compute node for the user inside the container.

This means the contained process always runs with whatever UID the "crunch" user has on the worker node. This means either:
  • containers must be built so they can run jobs as any UID, or
  • containers must be built so they can run jobs using a standard UID, say 1000, and all worker nodes must give the "crunch" user that same numeric UID.

In the former case, we could just choose UID 1000 and it would work.

In the latter case, we could just say --user=1000 and not be sensitive to the UID assignments on the worker node.

BTW, I don't see how this change fixes any security holes. Giving passwordless sudo to a user with UID=1000 is not significantly harder than giving the crunch user UID=0...?

Seems like we should just omit the --user argument entirely if we think we're in "leave my container alone" mode. (I suppose a lot of jobs would break if we started treating "no arvados sdk version specified" as "leave my container alone" mode...?)

The working assumption is that "root inside the container" == "root outside the container" (i.e. at least in older version of Docker, my understanding is that the root user is insufficiently contained such that it can break out of the cgroup namespacing either by exploit or by design and thus become root on the host). Supposedly in Docker 1.8 there is better privilege separation, but I would want to do more research.

Since we do actually allow strangers on the Internet to run Docker containers on our compute nodes, it seems much better that the security baseline is that the process inside the container isn't a privileged one. This was the intention behind adding --user=crunch in the first place. The fact that --user=crunch doesn't actually protect you because crunch could be an alias for root is a bug, not a feature.

The reason for using the calling process UID specifically (and not some default or random UID) is (a) to avoid UID collisions and (b) so that filesystem UIDs on bind mounts to match up with the host process. For example, if we run as random user id 31337 and but bind mounts are UID 1000 with mode 0600 then the user inside the container won't be able to read them.

It would actually be even better if the process UID inside the container was a different unprivileged user which didn't have access to sudo or docker on the host, but that's probably a bit more complicated to rig up.

  • containers must be built so they can run jobs as any UID

I think that is a reasonable expectation, that's what we try to do now, and seems work in practice.

  • Add "no_provision" to runtime constraints, when enabled it skips the "perl - ..." provisioning and runs the supplied crunch script directly.

What's the intent behind this change? It seems to be a misnomer, in that it doesn't seem to skip the provisioning, and in fact still relies on the provisioning -- it just skips the wrapper around the task process.

By "provisioning" I was just meant the "install Python SDK" part. So poor choice of name.

I would be fine with "no arvados sdk version specified" means "leave my container alone" mode. That's more or less how it used to work before the "pipe to perl to install arvados python sdk" behavior was added.

Is the idea here to drop the container-must-have-perl dependency?

Yes. As it turns out, even if the container does have Perl, one of the module imports in the bootstrap script (I think "use File::Path") fails because it's not available on a stock Ubuntu 1204 image. But it is much better not to depend on perl at all.

#13 Updated by Tom Clegg almost 4 years ago

Peter Amstutz wrote:

Since we do actually allow strangers on the Internet to run Docker containers on our compute nodes, it seems much better that the security baseline is that the process inside the container isn't a privileged one. This was the intention behind adding --user=crunch in the first place. The fact that --user=crunch doesn't actually protect you because crunch could be an alias for root is a bug, not a feature.

Let's be clear: the "--user=crunch" argument has never been a security device, and a "--user=1000" feature would not be a security device either. The security problem is: root in container can get root on host + container can give root to whoever it wants + user can run arbitrary container = user can get root on host. Changing --user=crunch to --user=1000 eliminates one but leaves other trivial ways of getting root in a container of one's own design, so the security value seems to be zero on its own. Anyway, it's fine not to fix the security problem right here, but we should avoid having illusions.

If we combine this with docker --cap-drop then we might get somewhere... of course it would be ideal if it were safe to allow root in the container, but that might(?) be further off.

The reason for using the calling process UID specifically (and not some default or random UID) is (a) to avoid UID collisions and (b) so that filesystem UIDs on bind mounts to match up with the host process. For example, if we run as random user id 31337 and but bind mounts are UID 1000 with mode 0600 then the user inside the container won't be able to read them.

Is there some situation where this 0600 issue happens? It seems to me we might be better off making bind-mounted files appear as UID=0 GID=0 inside the container: that way, container behavior wouldn't be sensitive to how well the container's UIDs correspond to the host's UIDs. (But perhaps this would be Hard.)

It would actually be even better if the process UID inside the container was a different unprivileged user which didn't have access to sudo or docker on the host, but that's probably a bit more complicated to rig up.

No matter what UID you choose, can't a purpose-built container trivially let that "unprivileged" process become root inside the container using setuid, and then escalate from there to root@host?

  • containers must be built so they can run jobs as any UID

I think that is a reasonable expectation, that's what we try to do now, and seems work in practice.

Right now we're insensitive to the UID, but we require an account called "crunch". Losing the "account called crunch" part is good, but can we do it without introducing new arbitrary constraints?

The proposed change seems to unnecessarily leak worker node configuration into the container. Say my cluster gives the crunch user UID 1000, yours gives the crunch user UID 1001. I'll build my images with files owned by UID=1000, you'll build yours with files owned by UID=1001, and they will "work" but they won't be portable.

If we're going to require containers to depend on a numeric UID, perhaps we should admit that's what we're doing, pick one ("--user=4005"), and decree that the crunch account on worker nodes must have UID=4005. (Ideally we don't have to rely on this, but if we're embracing the idea that portability depends on UID conventions, we should admit it, and establish those conventions.)

If possible, it seems like we should let the user specify a UID -- and possibly opt out of the --user argument entirely and leave it up to the Dockerfile. Either way it would be under the user's control, and portable between clusters.

(Any sense of how much stuff would break if we just dropped the --user part completely, and relied on the Dockerfile to specify a user?)

Assuming that change is too breaking, is there any reason not to let the job submission specify a username/id, where something like "" means let the Dockerfile decide, and the default (null or omitted) means use "crunch" (or "4005")?

By "provisioning" I was just meant the "install Python SDK" part. So poor choice of name.

I would be fine with "no arvados sdk version specified" means "leave my container alone" mode. That's more or less how it used to work before the "pipe to perl to install arvados python sdk" behavior was added.

This sounds good, if we can make it work. I'd like to avoid adding another miscellaneous knob to runtime_constraints. "If you don't ask for an arvados_sdk_version, we don't rely on perl being in the container" sounds pretty good.

(Note the "pipe to perl to deposit the arvados python sdk" part doesn't change (and might not have to) because it delivers to the worker host, not the container. It's only the "activate" part that uses perl.)

The "run mode" of the install-and-run script looks tantalizingly close to a no-op if you assume we have a docker image (so there's no remove_tree needed to clean tmp) and there's no arvados_sdk_version to install.

It looks like we might be relying on the perl script to create /tmp/crunch-job-work/ and /tmp/crunch-job-task-work/, though, only because creating them as docker volumes makes them writable only by root. Maybe a docker feature has appeared to solve this? Assigning mode 777 would be enough... Or, I suppose it shouldn't be too hard to write a mkdir && exec script that can be fed to sh instead of perl.

Apart from that, it seems like "using docker image, but not arvados_sdk_version" means we can skip the "perl -" part. Is this correct?

Is the idea here to drop the container-must-have-perl dependency?

Yes. As it turns out, even if the container does have Perl, one of the module imports in the bootstrap script (I think "use File::Path") fails because it's not available on a stock Ubuntu 1204 image. But it is much better not to depend on perl at all.

OK, this helps.

#14 Updated by Peter Amstutz almost 4 years ago

Tom Clegg wrote:

Peter Amstutz wrote:

Let's be clear: the "--user=crunch" argument has never been a security device, and a "--user=1000" feature would not be a security device either. The security problem is: root in container can get root on host + container can give root to whoever it wants + user can run arbitrary container = user can get root on host. Changing --user=crunch to --user=1000 eliminates one but leaves other trivial ways of getting root in a container of one's own design, so the security value seems to be zero on its own. Anyway, it's fine not to fix the security problem right here, but we should avoid having illusions.

If we combine this with docker --cap-drop then we might get somewhere... of course it would be ideal if it were safe to allow root in the container, but that might(?) be further off.

Here's a good article on Docker security:

https://opensource.com/business/14/9/security-for-docker

So, you're right, I thought Docker disallowed the suid bit on binaries by default, but on further research that's not true. We probably should add --cap-drop setuid --cap-drop setgid (or even see if things work with --cap-drop all) but I agree that is out of scope for this story.

Is there some situation where this 0600 issue happens? It seems to me we might be better off making bind-mounted files appear as UID=0 GID=0 inside the container: that way, container behavior wouldn't be sensitive to how well the container's UIDs correspond to the host's UIDs. (But perhaps this would be Hard.)

Ok, I'm don't think there is currently a problem with mode 0600 files, but there are definitely a couple of similar issues.

One problem was that at one point in the initial development we had /tmp/crunch-job-work/ and /tmp/crunch-job-task-work/ as bind mounts, and there would be a UID mismatch inside and outside the container which prevented crunch-job from cleaning up, which is why those directories were moved inside the container.

A second issue is that we currently have to run the FUSE mount with --allow-other because the user id inside the container can be different. If the user id inside/outside the container is the same, we could tighten the FUSE permissions a bit.

Right now we're insensitive to the UID, but we require an account called "crunch". Losing the "account called crunch" part is good, but can we do it without introducing new arbitrary constraints?

The proposed change seems to unnecessarily leak worker node configuration into the container. Say my cluster gives the crunch user UID 1000, yours gives the crunch user UID 1001. I'll build my images with files owned by UID=1000, you'll build yours with files owned by UID=1001, and they will "work" but they won't be portable.

I'm not sure what you mean by "not portable". As long as those files are readable/executable by "other" then the UID is of the actual process shouldn't matter.

If we're going to require containers to depend on a numeric UID, perhaps we should admit that's what we're doing, pick one ("--user=4005"), and decree that the crunch account on worker nodes must have UID=4005. (Ideally we don't have to rely on this, but if we're embracing the idea that portability depends on UID conventions, we should admit it, and establish those conventions.)

No, ideally the containers to be built to allow any UID to run processes inside the container and not assume some specific user (or worse, the root user).

If possible, it seems like we should let the user specify a UID -- and possibly opt out of the --user argument entirely and leave it up to the Dockerfile. Either way it would be under the user's control, and portable between clusters.

(Any sense of how much stuff would break if we just dropped the --user part completely, and relied on the Dockerfile to specify a user?)

Assuming that change is too breaking, is there any reason not to let the job submission specify a username/id, where something like "" means let the Dockerfile decide, and the default (null or omitted) means use "crunch"?

I suspect what will happen if we drop --user is that most containers will run with root as the default user because people won't have changed the default user in their Dockerfiles and/or constructed the container manually using "commit". But if we intend to harden the Docker container by other means (e.g. --drop-cap) then maybe that's okay.

So as Adam would say, what is the net here?

  • Remove --user for everyone?
  • Remove --user when arvados_sdk_version isn't provided?
  • Introduce a knob to remove --user?

By "provisioning" I was just meant the "install Python SDK" part. So poor choice of name.

I would be fine with "no arvados sdk version specified" means "leave my container alone" mode. That's more or less how it used to work before the "pipe to perl to install arvados python sdk" behavior was added.

This sounds good, if we can make it work. I'd like to avoid adding another miscellaneous knob to runtime_constraints. "If you don't ask for an arvados_sdk_version, we don't rely on perl being in the container" sounds pretty good.

(Note the "pipe to perl to deposit the arvados python sdk" part doesn't change (and might not have to) because it delivers to the worker host, not the container. It's only the "activate" part that uses perl.)

Right, I got confused, that script does two different things in two different contexts.

The "run mode" of the install-and-run script looks tantalizingly close to a no-op if you assume we have a docker image (so there's no remove_tree needed to clean tmp) and there's no arvados_sdk_version to install.

It looks like we might be relying on the perl script to create /tmp/crunch-job-work/ and /tmp/crunch-job-task-work/, though, only because creating them as docker volumes makes them writable only by root. Maybe a docker feature has appeared to solve this? Assigning mode 777 would be enough...

As I said above, the reasoning for making the host and internal UID match up is that it makes it much easier to bind mount writable volumes without weird permission problems.

However, if we make "don't pipe to perl to run" an opt-in feature (rather than the default behavior when arvados_sdk_version isn't present) then we can just specify that /tmp/crunch-job-work/ and /tmp/crunch-job-task-work/ don't get created for you without breaking anything.

Apart from that, it seems like "using docker image, but not arvados_sdk_version" means we can skip the "perl -" part. Is this correct?

Yes.

#15 Updated by Peter Amstutz almost 4 years ago

Just to sift out from the mega comment above, 7582-run-any-docker-container is trying to accomplish the following three things:

  1. Don't require a "crunch" user inside the container to run
  2. Don't require "perl" inside the container to run
  3. Don't introduce potentially breaking changes to the behavior of existing crunch scripts

I am flexible on how we accomplish this technically, but I'd like to converge on a minimal solution as quickly as possible so that this can get merged and deployed, because it's a blocker for being able to run containers for CWL.

#16 Updated by Tom Clegg almost 4 years ago

Peter Amstutz wrote:

I'm not sure what you mean by "not portable". As long as those files are readable/executable by "other" then the UID is of the actual process shouldn't matter.

As long as all files are o+rwx (and no programs behave differently depending on whether they own the files) then jobs will be portable. But really, how likely is that?

Is there anything better about --user=`id -u`, compared to --user=4005?

No, ideally the containers to be built to allow any UID to run processes inside the container and not assume some specific user (or worse, the root user).

Our goal here is to run arbitrary containers, not just "ideal" ones, right? "All non-zero UIDs are interchangeable" is a weird constraint. For example, it precludes normal things like "$HOME is owned by $USER", and even "$HOME is writable, but not world-writable".

I suspect what will happen if we drop --user is that most containers will run with root as the default user because people won't have changed the default user in their Dockerfiles and/or constructed the container manually using "commit". But if we intend to harden the Docker container by other means (e.g. --drop-cap) then maybe that's okay.

So as Adam would say, what is the net here?

  • Remove --user for everyone?
  • Remove --user when arvados_sdk_version isn't provided?
  • Introduce a knob to remove --user?

Is there any reason not to let the job submission specify a username/id, where something like "" means let the Dockerfile decide, and the default (null or omitted) means use "crunch" (or "4005")?

This lets you do what "docker run" lets you do, which is probably a relevant benchmark.

As I said above, the reasoning for making the host and internal UID match up is that it makes it much easier to bind mount writable volumes without weird permission problems.

Unfortunately data volumes are owned by root, not the user who invoked docker, so matching UIDs don't help.

tom@host:~$ docker run -it -v /foobar --user `id -u` debian:8 ls -ld /foobar
drwxr-xr-x 2 root root 4096 Oct 20 15:40 /foobar

However, if we make "don't pipe to perl to run" an opt-in feature (rather than the default behavior when arvados_sdk_version isn't present) then we can just specify that /tmp/crunch-job-work/ and /tmp/crunch-job-task-work/ don't get created for you without breaking anything.

It seems a bit weird to tie those temp dirs to the arvados_sdk_version feature. It also sounds like it would break jobs that use the arvados SDK that's already installed in their docker image.

How about "sh -c 'mkdir /tmp/crunch-job-work /tmp/crunch-job-task-work; exec $command'"?

(Ideally we wouldn't intrude in /tmp at all, but given we're already bind mounting /tmp/crunch-job/ by this point, making two empty dirs seems benign.)

#17 Updated by Tom Clegg almost 4 years ago

(after some offline discussion)

UID in container

When using a docker command, add a step (after running docker_install_script) similar to the "sanity check" that does something along the lines of
  • (id -u; id -u crunch; id -u nobody) 2>/dev/null
  • Take the first non-zero number, and use that as the --user= argument.
  • If that doesn't produce any non-zero numbers, fail the job.
We get the following features:
  • Can't accidentally start running your jobs as root.
  • Existing jobs that assume --user=crunch (and don't specify a USER in their Dockerfile) continue to work.
For now, we accept the following limitations:
  • No way for user to override choice of user when submitting a job. If you want a user other than crunch or nobody, you need to put a USER command in your Dockerfile.
  • No --drop-cap yet.
  • Existing jobs that rely on us to say --user=crunch, but specify a different non-root USER in their Dockerfile, will fail. (We expect this is the empty set.)
  • Container could have some other program installed at /bin/id. (Future work: bind-mount our own statically linked binary when doing this, so we can get the username:id mapping without executing user-supplied code.)

No perl in container

If docker and no arvados_sdk_version, do not run "perl -" or send script on stdin. Instead, accomplish the same thing using a sh one-liner something like
  • sh -c 'mkdir /tmp/crunch-job-work /tmp/crunch-job-task-work; exec $ENV{CRUNCH_SRC}/crunch_scripts/$Job->{script}'
    
  • (but with appropriate quoting)

#18 Updated by Peter Amstutz almost 4 years ago

New 7582-run-any-docker-container branch is pushed.

#19 Updated by Tom Clegg almost 4 years ago

at 22c45ae

Since this is now the first time we try to start a container, it's going to be annoying if we can't tell the difference between "can't run container" and "default user is no good". As is, it looks like a range of docker/container failures will be reported as "Docker image default user is 'root'" (and any hints from docker's stderr will be sent to /dev/null) which will be confusing. I see that it's inconvenient to get the stdout with the tools we have in front of us. Maybe it would be enough to un-hide stderr and change the error message to say "...or there was a problem running `id` in the container (see above)"?

How about: docker run [...] id --user | grep -qxv 0

In any case a log message just before this will make any errors easier to follow, something like
  • Log(undef, "Looking up UID in container");

Since we're doing something a bit mysterious here (especially the "nobody" case, which will make stock images run very differently than they do anywhere else) we should also log what the eventual choice was. Technically you can find it later on in the "docker run" command line if you think of looking there, but an explicit "using --user=foo" or "using image's default user" would be much more friendly.

It looks like the "default user is non-root" case sets $dockeruser to "", but that counts as failure here -- should be if (!defined $dockeruser)?
  •   if (!$dockeruser) {
        croak("Docker image default user is 'root', and does not have a non-root 'crunch' or 'nobody' user.");
      }
    

$dockeruser should probably be called $dockeruserarg

(nit) you can nest "\Q\E" to do stuff like "sh -c \Qmkdir \Q$foo\E && exec \Q$bar\E\E", which might be easier / more thorough than doing the \'\" stuff manually. (The current version looks like it's fine, unless tmpdirs have quotation marks, which seems unlikely.)

#20 Updated by Tom Clegg almost 4 years ago

7582-crunch-runner @ 5fb633e

(first round of comments, haven't looked at all of upload*.go yet)

Move IArvadosClient from SDK to crunchrunner

Add signal handler test (maybe sigint a "sleep 10" command?) -- INT and KILL are the main ones we rely on day-to-day for cancel.

Add signal handler test for receiving a signal after process ended or assure me that cmd.Process will be nil after cmd.Run() returns so we don't send signals (it's possible Go does this for us, but I don't see it in the docs).

There's a possible race in "cmd.Process == nil" but it seems like it doesn't matter too much in practice. We already send many SIGINT from crunch-job, which would address the startup side, and it seems like the worst that can happen on the process-ending side is that we get a weird error when killing a job. So this is OK, right? (Is it worth a comment that we're deliberately accepting some harmless races?)

I think type TempFail struct{ InnerError error } can be just type TempFail struct{error} or even type TempFail error, and you can delete the (TempFail)Error() func.

Looks like tempfail isn't actually used, so status could reduce to a success bool:

-       const success = 1
-       const permfail = 2
-       const tempfail = 2
-       var status int
+       var success bool

Behavior seems arbitrary when the exit code is listed in two or more of the *Codes lists. I'm inclined to say this should be the "most failed" of the interpretations specified rather than the most successful, or we should just check for contradictions up front and permfail if we find any.

return TempFail{nil} seems sketchy. Since we only ever pass it to log.Print() I suppose it prints "nil" instead of crashing, but how about TempFail{fmt.Errorf("process exited %d", exitCode)}?

Why umask?

gofmt

There's no need for stuff like log.Print("--TestUploadEmptyFile--"). Just run go test -check.vv, gocheck will print banners for you.

Test files & dirs with spaces, colons, newlines (perhaps cover all cases with a single file "./evil \n sub:dir/evil\n :file:name:" ...)

An opportunity:
  •         data := make([]byte, 1024*1024-1)
    -       for i := 0; i < 1024*1024-1; i++ {
    +       for i := range data {
    

Add to run-tests

#21 Updated by Peter Amstutz almost 4 years ago

Tom Clegg wrote:

7582-crunch-runner @ 5fb633e

(first round of comments, haven't looked at all of upload*.go yet)

Move IArvadosClient from SDK to crunchrunner

Done.

Add signal handler test (maybe sigint a "sleep 10" command?) -- INT and KILL are the main ones we rely on day-to-day for cancel.

Add signal handler test for receiving a signal after process ended or assure me that cmd.Process will be nil after cmd.Run() returns so we don't send signals (it's possible Go does this for us, but I don't see it in the docs).

There's a possible race in "cmd.Process == nil" but it seems like it doesn't matter too much in practice. We already send many SIGINT from crunch-job, which would address the startup side, and it seems like the worst that can happen on the process-ending side is that we get a weird error when killing a job. So this is OK, right? (Is it worth a comment that we're deliberately accepting some harmless races?)

Adjusted it so that it sets up the signal handlers after the process is started, and removes the signal handlers using signal.Stop() after the process completes.

I think type TempFail struct{ InnerError error } can be just type TempFail struct{error} or even type TempFail error, and you can delete the (TempFail)Error() func.

Done.

Looks like tempfail isn't actually used, so status could reduce to a success bool:
[...]

Done.

Behavior seems arbitrary when the exit code is listed in two or more of the *Codes lists. I'm inclined to say this should be the "most failed" of the interpretations specified rather than the most successful, or we should just check for contradictions up front and permfail if we find any.

Reordered the list to permFail, tempFail, success.

return TempFail{nil} seems sketchy. Since we only ever pass it to log.Print() I suppose it prints "nil" instead of crashing, but how about TempFail{fmt.Errorf("process exited %d", exitCode)}?

Done.

Why umask?

run-command does it, but I don't think the rationale is any more specific than "minimize who can read the files out of an abundance of caution".

gofmt

There's no need for stuff like log.Print("--TestUploadEmptyFile--"). Just run go test -check.vv, gocheck will print banners for you.

That's useful. Removed the banners.

Test files & dirs with spaces, colons, newlines (perhaps cover all cases with a single file "./evil \n sub:dir/evil\n :file:name:" ...)

Done.

An opportunity:
  • [...]

Done.

Add to run-tests

7582-crunch-runner in arvados-dev

#22 Updated by Peter Amstutz almost 4 years ago

Crunch runner initial documentation page: https://dev.arvados.org/projects/arvados/wiki/Crunch_runner

#23 Updated by Tom Clegg almost 4 years ago

7582-run-any-docker-container at dc3915c

I'm guessing this was meant to be

-$docker_bin run --rm --user=$try_user ...
+$docker_bin run --rm $try_user_arg ...

Other than that LGTM

#24 Updated by Peter Amstutz almost 4 years ago

Tom Clegg wrote:

7582-run-any-docker-container at dc3915c

I'm guessing this was meant to be
[...]

Other than that LGTM

Fixed, also removed "stdbuf" from the "minimal execution" mode. (This adjusts glibc to turn off stdout buffering, which is helpful for logging, but not guaranteed to be installed).

#25 Updated by Tom Clegg almost 4 years ago

7582-crunch-runner at 5d3de21

Need to check whether err returned by cmd.Start() is nil

(*ManifestStreamWriter)Write() should probably return a non-nil error in case anyone tries to use it. Or... might as well implement it,

return m.ReadFrom(bytes.NewReader(p))

How about

-    for true {
-        select {
-        case block, valid := <-uploader:
-            if !valid {
-                finish <- errors
-                return
-            }
-            ...
-        }
-    }
+    for block := range uploader {
+        ...
+    }
+    finish <- errors

Godoc says a successful io.Copy returns err==nil, not EOF, so it seems like we shouldn't ignore err==EOF in WalkFunc.

It looks like symlinked files are included in the output. Is this intentional? (Should document on the wiki page.) If so, is it worth paying special attention to them in the "(%v bytes)" log?

if count > 0 { looks superfluous in (*ManifestStreamWriter)ReadFrom() -- that's the only way m.Block.offset can have reached BLOCKSIZE.

In Finish(), please unpyramid as if v.upload == nil { continue } ...and "v" should probably be "s" for "stream"?

In Finish(), if errors != nil { looks superfluous (the range block will just run zero times if the slice is nil).

Peter Amstutz wrote:

Adjusted it so that it sets up the signal handlers after the process is started, and removes the signal handlers using signal.Stop() after the process completes.

Hm. I'm having second thoughts about whether it's OK to have a window (between starting the child and deciding to forward signals) where we don't propagate signals. Would that leave our docker container running with a zombie process, or something? (I assume so, because otherwise we wouldn't need the signal handling at all, right?)

I think doing things in this order would make us safe from startup races:
  1. make channel and call signal.Notify
  2. start the process (if we get a signal right now, it will wait in the channel until we're ready to forward it)
  3. start the goroutine that listens for signals and forwards them

Why umask?

run-command does it, but I don't think the rationale is any more specific than "minimize who can read the files out of an abundance of caution".

Generally "leave things alone" seems to be our goal here, but AFAICT the default starting umask in docker seems to be "whatever the docker daemon's umask was on the host", which isn't predictable/portable, and predictable seems better than not.

The only situation I can think of where this would make a difference (given we expect everything in the container to run as the same user) is a job that writes some files and then makes a tarball of them in the output dir. The tarball would then have unconventionally strict permissions. And that doesn't sound so bad.

Should probably document umask on the wiki page.

Rest LGTM

#26 Updated by Tom Clegg almost 4 years ago

Peter Amstutz wrote:

Fixed, also removed "stdbuf" from the "minimal execution" mode. (This adjusts glibc to turn off stdout buffering, which is helpful for logging, but not guaranteed to be installed).

Commit message says stdbuf "solves problem of stdout showing up out of order and with incorrect time stamps in logs."

Any idea whether running stdbuf outside the container has the same effect?

#27 Updated by Peter Amstutz almost 4 years ago

Tom Clegg wrote:

Peter Amstutz wrote:

Fixed, also removed "stdbuf" from the "minimal execution" mode. (This adjusts glibc to turn off stdout buffering, which is helpful for logging, but not guaranteed to be installed).

Commit message says stdbuf "solves problem of stdout showing up out of order and with incorrect time stamps in logs."

Any idea whether running stdbuf outside the container has the same effect?

Details:

stdbuf modifies the behavior of glibc FILE*; it's entirely a userspace concept. Right now we're setting --output=0 --error=0 (no buffering) but actually it would be better to use --output=L --error=L (line buffering, since it's going to be line buffered by crunchstat and crunch-job anyway).

From looking at coreutils source, the approach seems to be that it sets LD_PRELOAD to libstdbuf.so and then executes the subprocess with _STDBUF_E, _STDBUF_I and _STDBUF_O in the environment. libstdbuf.so has a ELF "constructor" function which actually modifies the buffering mode during program initialization.

$ stdbuf -o0 env
_STDBUF_O=0
LD_PRELOAD=/usr/lib/coreutils/libstdbuf.so

_STDBUF_E, _STDBUF_I and _STDBUF_O are specific to stdbuf and not recognized by glibc generally.

This means stdbuf settings can propagate down to subprocesses, provided they don't clear the environment, but won't (and shouldn't) propagate into the Docker environment unless we add --env, but that only works if we know that libstdbuf.so is inside the container (but if that is the case, the stdbuf program is almost certainly already present.)

We could enhance the minimal /bin/sh bootstrap to use `which stdbuf`.

An alternate approach is to allocate a PTY with TERM=dumb, which modifies the buffering (I think), however some programs modify their behavior based on whether they are printing to PTY or a pipe (for example whether they print live status bars.)

Found this gem in stdbuf.c:

      switch (c)
        {
        /* Old McDonald had a farm ei...  */
        case 'e':
        case 'i':
        case 'o':

#28 Updated by Peter Amstutz almost 4 years ago

Tom Clegg wrote:

7582-crunch-runner at 5d3de21

Need to check whether err returned by cmd.Start() is nil

(*ManifestStreamWriter)Write() should probably return a non-nil error in case anyone tries to use it. Or... might as well implement it, [...]

Didn't realize it was that easy :-) Done.

How about
[...]

Godoc says a successful io.Copy returns err==nil, not EOF, so it seems like we shouldn't ignore err==EOF in WalkFunc.

That was my mistake, I was allowing EOF to propagate from ReadFrom() which was incorrect.

It looks like symlinked files are included in the output. Is this intentional? (Should document on the wiki page.) If so, is it worth paying special attention to them in the "(%v bytes)" log?

That's correct in that it is intentional but not ideal. The Python code does some fancy footwork to identify symlinks to keep mounted files and merge directly into the manifest, however it relies on infrastructure we have in the Python SDK that we don't have in the Go SDK so porting the logic to Go will require writing a lot more code.

if count > 0 { looks superfluous in (*ManifestStreamWriter)ReadFrom() -- that's the only way m.Block.offset can have reached BLOCKSIZE.

That's right, fixed.

In Finish(), please unpyramid as if v.upload == nil { continue } ...and "v" should probably be "s" for "stream"?

Done.

In Finish(), if errors != nil { looks superfluous (the range block will just run zero times if the slice is nil).

Done.

Peter Amstutz wrote:

Adjusted it so that it sets up the signal handlers after the process is started, and removes the signal handlers using signal.Stop() after the process completes.

Hm. I'm having second thoughts about whether it's OK to have a window (between starting the child and deciding to forward signals) where we don't propagate signals. Would that leave our docker container running with a zombie process, or something? (I assume so, because otherwise we wouldn't need the signal handling at all, right?)

I think doing things in this order would make us safe from startup races:
  1. make channel and call signal.Notify
  2. start the process (if we get a signal right now, it will wait in the channel until we're ready to forward it)
  3. start the goroutine that listens for signals and forwards them

The sequence is now:

  1. Create command object
  2. Set up signal notify channel
  3. Start process
  4. Start goroutine
  5. Goroutine forwards signals to process and records the last signal that was received (if any)
  6. Wait for process to terminate
  7. Stop signal notify channel
  8. Check if any signals were received and PermFail if so.

Why umask?

run-command does it, but I don't think the rationale is any more specific than "minimize who can read the files out of an abundance of caution".

Generally "leave things alone" seems to be our goal here, but AFAICT the default starting umask in docker seems to be "whatever the docker daemon's umask was on the host", which isn't predictable/portable, and predictable seems better than not.

The only situation I can think of where this would make a difference (given we expect everything in the container to run as the same user) is a job that writes some files and then makes a tarball of them in the output dir. The tarball would then have unconventionally strict permissions. And that doesn't sound so bad.

Should probably document umask on the wiki page.

I reintroduced umask (I had removed it) and set it to 0022.

On further consideration, umask 0077 only provides isolation if we also run containers with guaranteed different host UIDs and reopening the discussion about whether to use UIDs that don't correspond to users inside the container. Since we're currently not doing that, we might as well go with the least surprising behavior.

#29 Updated by Tom Clegg almost 4 years ago

7582-crunch-runner at 0fa7cd5

nit: update comment in setupSignals now that it doesn't forward signals any more (and maybe move comment above function per godoc convention)

defer signal.Stop(sigChan) won't happen until runner() returns. It looks like the intent is to Stop() between Wait() and if caughtSignal != nil. It seems the worst thing that can happen is a more ungraceful crash when we get SIGINT during WriteTree, so perhaps this isn't a big deal? OTOH if we do want to be sure we've consumed sigChan before checking caughtSignal, we could do something like
  • done := make(chan struct{}) ...above goroutine
  • close(done) ...in the goroutine, after range is done
  • signal.Stop(sigChan); close(sigChan); <-done ...after Wait()

Rest of the changes LGTM

#30 Updated by Peter Amstutz almost 4 years ago

Tom Clegg wrote:

7582-crunch-runner at 0fa7cd5

nit: update comment in setupSignals now that it doesn't forward signals any more (and maybe move comment above function per godoc convention)

Fixed.

defer signal.Stop(sigChan) won't happen until runner() returns. It looks like the intent is to Stop() between Wait() and if caughtSignal != nil.

That's correct, I was just confused about the scoping of defer.

It seems the worst thing that can happen is a more ungraceful crash when we get SIGINT during WriteTree, so perhaps this isn't a big deal?

The intent is to be able to still cancel the process during WriteTree(), if for some reason the upload gets stuck.

OTOH if we do want to be sure we've consumed sigChan before checking caughtSignal, we could do something like

  • done := make(chan struct{}) ...above goroutine
  • close(done) ...in the goroutine, after range is done
  • signal.Stop(sigChan); close(sigChan); <-done ...after Wait()

Done.

Rest of the changes LGTM

#31 Updated by Tom Clegg almost 4 years ago

LGTM @ 50484a7

#32 Updated by Peter Amstutz almost 4 years ago

  • Status changed from In Progress to Resolved

Also available in: Atom PDF