Idea #7582
closed[CWL] binary run-command shim for CWL
Description
CWL jobs are allowed to specify arbitrary Docker containers. In particular, this means depending on our existing run-command script (which depends on the Arvados Python SDK, which itself has long chain of further dependencies) is a non-starter, because we can't assume a friendly environment inside the container.
Implement a simplified run-command (crunch-runner?) in Go that is compiled to a binary with minimum external dependencies that can be loaded into arbitrary containers. This will act as a shim to run the command line generated by CWL and communicate the results to Arvados.
Required features:
- Run arbitrary program in a subprocess
- Set environment variables for subprocess
- Redirect stdin/stdout
- Optionally symlink input files to output directory
- Determine success/failure based on exit code
- Upload contents of output directory to collection
- Communicate task success/failure to API server along with output collection.
Updated by Peter Amstutz about 9 years ago
- Subject changed from [CWL] binary run-command shim to [CWL] binary run-command shim for CWL
Updated by Brett Smith about 9 years ago
- Target version set to 2015-10-28 sprint
Updated by Peter Amstutz about 9 years ago
- Status changed from New to In Progress
- Story points changed from 3.0 to 2.0
Updated by Peter Amstutz about 9 years ago
Turns out this does require a few tweaks to crunch-job, but not major surgery.
Branch 7582-run-any-docker-container is up for review with the following changes:
- Don't use
--user=crunch
because (a) there might not be a user called crunch in the container and (b) the crunch user could be aliased to root, so it's a terrible security bug. This fixes it to use the UID of the calling process on the compute node for the user inside the container. - Add "no_provision" to runtime constraints, when enabled it skips the "perl - ..." provisioning and runs the supplied crunch script directly.
Updated by Tom Clegg about 9 years ago
Peter Amstutz wrote:
This means the contained process always runs with whatever UID the "crunch" user has on the worker node. This means either:This fixes it to use the UID of the calling process on the compute node for the user inside the container.
- containers must be built so they can run jobs as any UID, or
- containers must be built so they can run jobs using a standard UID, say 1000, and all worker nodes must give the "crunch" user that same numeric UID.
In the former case, we could just choose UID 1000 and it would work.
In the latter case, we could just say --user=1000 and not be sensitive to the UID assignments on the worker node.
BTW, I don't see how this change fixes any security holes. Giving passwordless sudo to a user with UID=1000 is not significantly harder than giving the crunch user UID=0...?
Seems like we should just omit the --user
argument entirely if we think we're in "leave my container alone" mode. (I suppose a lot of jobs would break if we started treating "no arvados sdk version specified" as "leave my container alone" mode...?)
- Add "no_provision" to runtime constraints, when enabled it skips the "perl - ..." provisioning and runs the supplied crunch script directly.
What's the intent behind this change? It seems to be a misnomer, in that it doesn't seem to skip the provisioning, and in fact still relies on the provisioning -- it just skips the wrapper around the task process. Is the idea here to drop the container-must-have-perl dependency?
Updated by Peter Amstutz about 9 years ago
Tom Clegg wrote:
Peter Amstutz wrote:
This means the contained process always runs with whatever UID the "crunch" user has on the worker node. This means either:This fixes it to use the UID of the calling process on the compute node for the user inside the container.
- containers must be built so they can run jobs as any UID, or
- containers must be built so they can run jobs using a standard UID, say 1000, and all worker nodes must give the "crunch" user that same numeric UID.
In the former case, we could just choose UID 1000 and it would work.
In the latter case, we could just say --user=1000 and not be sensitive to the UID assignments on the worker node.
BTW, I don't see how this change fixes any security holes. Giving passwordless sudo to a user with UID=1000 is not significantly harder than giving the crunch user UID=0...?
Seems like we should just omit the
--user
argument entirely if we think we're in "leave my container alone" mode. (I suppose a lot of jobs would break if we started treating "no arvados sdk version specified" as "leave my container alone" mode...?)
The working assumption is that "root inside the container" == "root outside the container" (i.e. at least in older version of Docker, my understanding is that the root user is insufficiently contained such that it can break out of the cgroup namespacing either by exploit or by design and thus become root on the host). Supposedly in Docker 1.8 there is better privilege separation, but I would want to do more research.
Since we do actually allow strangers on the Internet to run Docker containers on our compute nodes, it seems much better that the security baseline is that the process inside the container isn't a privileged one. This was the intention behind adding --user=crunch in the first place. The fact that --user=crunch doesn't actually protect you because crunch could be an alias for root is a bug, not a feature.
The reason for using the calling process UID specifically (and not some default or random UID) is (a) to avoid UID collisions and (b) so that filesystem UIDs on bind mounts to match up with the host process. For example, if we run as random user id 31337 and but bind mounts are UID 1000 with mode 0600 then the user inside the container won't be able to read them.
It would actually be even better if the process UID inside the container was a different unprivileged user which didn't have access to sudo
or docker
on the host, but that's probably a bit more complicated to rig up.
- containers must be built so they can run jobs as any UID
I think that is a reasonable expectation, that's what we try to do now, and seems work in practice.
- Add "no_provision" to runtime constraints, when enabled it skips the "perl - ..." provisioning and runs the supplied crunch script directly.
What's the intent behind this change? It seems to be a misnomer, in that it doesn't seem to skip the provisioning, and in fact still relies on the provisioning -- it just skips the wrapper around the task process.
By "provisioning" I was just meant the "install Python SDK" part. So poor choice of name.
I would be fine with "no arvados sdk version specified" means "leave my container alone" mode. That's more or less how it used to work before the "pipe to perl to install arvados python sdk" behavior was added.
Is the idea here to drop the container-must-have-perl dependency?
Yes. As it turns out, even if the container does have Perl, one of the module imports in the bootstrap script (I think "use File::Path") fails because it's not available on a stock Ubuntu 1204 image. But it is much better not to depend on perl at all.
Updated by Tom Clegg about 9 years ago
Peter Amstutz wrote:
Since we do actually allow strangers on the Internet to run Docker containers on our compute nodes, it seems much better that the security baseline is that the process inside the container isn't a privileged one. This was the intention behind adding --user=crunch in the first place. The fact that --user=crunch doesn't actually protect you because crunch could be an alias for root is a bug, not a feature.
Let's be clear: the "--user=crunch" argument has never been a security device, and a "--user=1000" feature would not be a security device either. The security problem is: root in container can get root on host + container can give root to whoever it wants + user can run arbitrary container = user can get root on host. Changing --user=crunch to --user=1000 eliminates one but leaves other trivial ways of getting root in a container of one's own design, so the security value seems to be zero on its own. Anyway, it's fine not to fix the security problem right here, but we should avoid having illusions.
If we combine this with docker --cap-drop then we might get somewhere... of course it would be ideal if it were safe to allow root in the container, but that might(?) be further off.
The reason for using the calling process UID specifically (and not some default or random UID) is (a) to avoid UID collisions and (b) so that filesystem UIDs on bind mounts to match up with the host process. For example, if we run as random user id 31337 and but bind mounts are UID 1000 with mode 0600 then the user inside the container won't be able to read them.
Is there some situation where this 0600 issue happens? It seems to me we might be better off making bind-mounted files appear as UID=0 GID=0 inside the container: that way, container behavior wouldn't be sensitive to how well the container's UIDs correspond to the host's UIDs. (But perhaps this would be Hard.)
It would actually be even better if the process UID inside the container was a different unprivileged user which didn't have access to
sudo
ordocker
on the host, but that's probably a bit more complicated to rig up.
No matter what UID you choose, can't a purpose-built container trivially let that "unprivileged" process become root inside the container using setuid, and then escalate from there to root@host?
- containers must be built so they can run jobs as any UID
I think that is a reasonable expectation, that's what we try to do now, and seems work in practice.
Right now we're insensitive to the UID, but we require an account called "crunch". Losing the "account called crunch" part is good, but can we do it without introducing new arbitrary constraints?
The proposed change seems to unnecessarily leak worker node configuration into the container. Say my cluster gives the crunch user UID 1000, yours gives the crunch user UID 1001. I'll build my images with files owned by UID=1000, you'll build yours with files owned by UID=1001, and they will "work" but they won't be portable.
If we're going to require containers to depend on a numeric UID, perhaps we should admit that's what we're doing, pick one ("--user=4005"), and decree that the crunch account on worker nodes must have UID=4005. (Ideally we don't have to rely on this, but if we're embracing the idea that portability depends on UID conventions, we should admit it, and establish those conventions.)
If possible, it seems like we should let the user specify a UID -- and possibly opt out of the --user argument entirely and leave it up to the Dockerfile. Either way it would be under the user's control, and portable between clusters.
(Any sense of how much stuff would break if we just dropped the --user
part completely, and relied on the Dockerfile to specify a user?)
Assuming that change is too breaking, is there any reason not to let the job submission specify a username/id, where something like "" means let the Dockerfile decide, and the default (null or omitted) means use "crunch" (or "4005")?
By "provisioning" I was just meant the "install Python SDK" part. So poor choice of name.
I would be fine with "no arvados sdk version specified" means "leave my container alone" mode. That's more or less how it used to work before the "pipe to perl to install arvados python sdk" behavior was added.
This sounds good, if we can make it work. I'd like to avoid adding another miscellaneous knob to runtime_constraints. "If you don't ask for an arvados_sdk_version, we don't rely on perl being in the container" sounds pretty good.
(Note the "pipe to perl to deposit the arvados python sdk" part doesn't change (and might not have to) because it delivers to the worker host, not the container. It's only the "activate" part that uses perl.)
The "run mode" of the install-and-run script looks tantalizingly close to a no-op if you assume we have a docker image (so there's no remove_tree needed to clean tmp) and there's no arvados_sdk_version to install.
It looks like we might be relying on the perl script to create /tmp/crunch-job-work/ and /tmp/crunch-job-task-work/, though, only because creating them as docker volumes makes them writable only by root. Maybe a docker feature has appeared to solve this? Assigning mode 777 would be enough... Or, I suppose it shouldn't be too hard to write a mkdir && exec
script that can be fed to sh
instead of perl
.
Apart from that, it seems like "using docker image, but not arvados_sdk_version" means we can skip the "perl -" part. Is this correct?
Is the idea here to drop the container-must-have-perl dependency?
Yes. As it turns out, even if the container does have Perl, one of the module imports in the bootstrap script (I think "use File::Path") fails because it's not available on a stock Ubuntu 1204 image. But it is much better not to depend on perl at all.
OK, this helps.
Updated by Peter Amstutz about 9 years ago
Tom Clegg wrote:
Peter Amstutz wrote:
Let's be clear: the "--user=crunch" argument has never been a security device, and a "--user=1000" feature would not be a security device either. The security problem is: root in container can get root on host + container can give root to whoever it wants + user can run arbitrary container = user can get root on host. Changing --user=crunch to --user=1000 eliminates one but leaves other trivial ways of getting root in a container of one's own design, so the security value seems to be zero on its own. Anyway, it's fine not to fix the security problem right here, but we should avoid having illusions.
If we combine this with docker --cap-drop then we might get somewhere... of course it would be ideal if it were safe to allow root in the container, but that might(?) be further off.
Here's a good article on Docker security:
https://opensource.com/business/14/9/security-for-docker
So, you're right, I thought Docker disallowed the suid bit on binaries by default, but on further research that's not true. We probably should add --cap-drop setuid --cap-drop setgid
(or even see if things work with --cap-drop all
) but I agree that is out of scope for this story.
Is there some situation where this 0600 issue happens? It seems to me we might be better off making bind-mounted files appear as UID=0 GID=0 inside the container: that way, container behavior wouldn't be sensitive to how well the container's UIDs correspond to the host's UIDs. (But perhaps this would be Hard.)
Ok, I'm don't think there is currently a problem with mode 0600 files, but there are definitely a couple of similar issues.
One problem was that at one point in the initial development we had /tmp/crunch-job-work/ and /tmp/crunch-job-task-work/ as bind mounts, and there would be a UID mismatch inside and outside the container which prevented crunch-job from cleaning up, which is why those directories were moved inside the container.
A second issue is that we currently have to run the FUSE mount with --allow-other
because the user id inside the container can be different. If the user id inside/outside the container is the same, we could tighten the FUSE permissions a bit.
Right now we're insensitive to the UID, but we require an account called "crunch". Losing the "account called crunch" part is good, but can we do it without introducing new arbitrary constraints?
The proposed change seems to unnecessarily leak worker node configuration into the container. Say my cluster gives the crunch user UID 1000, yours gives the crunch user UID 1001. I'll build my images with files owned by UID=1000, you'll build yours with files owned by UID=1001, and they will "work" but they won't be portable.
I'm not sure what you mean by "not portable". As long as those files are readable/executable by "other" then the UID is of the actual process shouldn't matter.
If we're going to require containers to depend on a numeric UID, perhaps we should admit that's what we're doing, pick one ("--user=4005"), and decree that the crunch account on worker nodes must have UID=4005. (Ideally we don't have to rely on this, but if we're embracing the idea that portability depends on UID conventions, we should admit it, and establish those conventions.)
No, ideally the containers to be built to allow any UID to run processes inside the container and not assume some specific user (or worse, the root user).
If possible, it seems like we should let the user specify a UID -- and possibly opt out of the --user argument entirely and leave it up to the Dockerfile. Either way it would be under the user's control, and portable between clusters.
(Any sense of how much stuff would break if we just dropped the
--user
part completely, and relied on the Dockerfile to specify a user?)Assuming that change is too breaking, is there any reason not to let the job submission specify a username/id, where something like "" means let the Dockerfile decide, and the default (null or omitted) means use "crunch"?
I suspect what will happen if we drop --user is that most containers will run with root as the default user because people won't have changed the default user in their Dockerfiles and/or constructed the container manually using "commit". But if we intend to harden the Docker container by other means (e.g. --drop-cap) then maybe that's okay.
So as Adam would say, what is the net here?
- Remove --user for everyone?
- Remove --user when arvados_sdk_version isn't provided?
- Introduce a knob to remove --user?
By "provisioning" I was just meant the "install Python SDK" part. So poor choice of name.
I would be fine with "no arvados sdk version specified" means "leave my container alone" mode. That's more or less how it used to work before the "pipe to perl to install arvados python sdk" behavior was added.
This sounds good, if we can make it work. I'd like to avoid adding another miscellaneous knob to runtime_constraints. "If you don't ask for an arvados_sdk_version, we don't rely on perl being in the container" sounds pretty good.
(Note the "pipe to perl to deposit the arvados python sdk" part doesn't change (and might not have to) because it delivers to the worker host, not the container. It's only the "activate" part that uses perl.)
Right, I got confused, that script does two different things in two different contexts.
The "run mode" of the install-and-run script looks tantalizingly close to a no-op if you assume we have a docker image (so there's no remove_tree needed to clean tmp) and there's no arvados_sdk_version to install.
It looks like we might be relying on the perl script to create /tmp/crunch-job-work/ and /tmp/crunch-job-task-work/, though, only because creating them as docker volumes makes them writable only by root. Maybe a docker feature has appeared to solve this? Assigning mode 777 would be enough...
As I said above, the reasoning for making the host and internal UID match up is that it makes it much easier to bind mount writable volumes without weird permission problems.
However, if we make "don't pipe to perl to run" an opt-in feature (rather than the default behavior when arvados_sdk_version isn't present) then we can just specify that /tmp/crunch-job-work/ and /tmp/crunch-job-task-work/ don't get created for you without breaking anything.
Apart from that, it seems like "using docker image, but not arvados_sdk_version" means we can skip the "perl -" part. Is this correct?
Yes.
Updated by Peter Amstutz about 9 years ago
Just to sift out from the mega comment above, 7582-run-any-docker-container is trying to accomplish the following three things:
- Don't require a "crunch" user inside the container to run
- Don't require "perl" inside the container to run
- Don't introduce potentially breaking changes to the behavior of existing crunch scripts
I am flexible on how we accomplish this technically, but I'd like to converge on a minimal solution as quickly as possible so that this can get merged and deployed, because it's a blocker for being able to run containers for CWL.
Updated by Tom Clegg about 9 years ago
Peter Amstutz wrote:
I'm not sure what you mean by "not portable". As long as those files are readable/executable by "other" then the UID is of the actual process shouldn't matter.
As long as all files are o+rwx (and no programs behave differently depending on whether they own the files) then jobs will be portable. But really, how likely is that?
Is there anything better about --user=`id -u`
, compared to --user=4005
?
No, ideally the containers to be built to allow any UID to run processes inside the container and not assume some specific user (or worse, the root user).
Our goal here is to run arbitrary containers, not just "ideal" ones, right? "All non-zero UIDs are interchangeable" is a weird constraint. For example, it precludes normal things like "$HOME is owned by $USER", and even "$HOME is writable, but not world-writable".
I suspect what will happen if we drop --user is that most containers will run with root as the default user because people won't have changed the default user in their Dockerfiles and/or constructed the container manually using "commit". But if we intend to harden the Docker container by other means (e.g. --drop-cap) then maybe that's okay.
So as Adam would say, what is the net here?
- Remove --user for everyone?
- Remove --user when arvados_sdk_version isn't provided?
- Introduce a knob to remove --user?
Is there any reason not to let the job submission specify a username/id, where something like "" means let the Dockerfile decide, and the default (null or omitted) means use "crunch" (or "4005")?
This lets you do what "docker run" lets you do, which is probably a relevant benchmark.
As I said above, the reasoning for making the host and internal UID match up is that it makes it much easier to bind mount writable volumes without weird permission problems.
Unfortunately data volumes are owned by root, not the user who invoked docker, so matching UIDs don't help.
tom@host:~$ docker run -it -v /foobar --user `id -u` debian:8 ls -ld /foobar drwxr-xr-x 2 root root 4096 Oct 20 15:40 /foobar
However, if we make "don't pipe to perl to run" an opt-in feature (rather than the default behavior when arvados_sdk_version isn't present) then we can just specify that /tmp/crunch-job-work/ and /tmp/crunch-job-task-work/ don't get created for you without breaking anything.
It seems a bit weird to tie those temp dirs to the arvados_sdk_version feature. It also sounds like it would break jobs that use the arvados SDK that's already installed in their docker image.
How about "sh -c 'mkdir /tmp/crunch-job-work /tmp/crunch-job-task-work; exec $command'"
?
(Ideally we wouldn't intrude in /tmp at all, but given we're already bind mounting /tmp/crunch-job/ by this point, making two empty dirs seems benign.)
Updated by Tom Clegg about 9 years ago
(after some offline discussion)
UID in container¶
When using a docker command, add a step (after running docker_install_script) similar to the "sanity check" that does something along the lines of(id -u; id -u crunch; id -u nobody) 2>/dev/null
- Take the first non-zero number, and use that as the --user= argument.
- If that doesn't produce any non-zero numbers, fail the job.
- Can't accidentally start running your jobs as root.
- Existing jobs that assume --user=crunch (and don't specify a USER in their Dockerfile) continue to work.
- No way for user to override choice of user when submitting a job. If you want a user other than crunch or nobody, you need to put a USER command in your Dockerfile.
- No --drop-cap yet.
- Existing jobs that rely on us to say
--user=crunch
, but specify a different non-root USER in their Dockerfile, will fail. (We expect this is the empty set.) - Container could have some other program installed at
/bin/id
. (Future work: bind-mount our own statically linked binary when doing this, so we can get the username:id mapping without executing user-supplied code.)
No perl in container¶
If docker and no arvados_sdk_version, do not run "perl -" or send script on stdin. Instead, accomplish the same thing using a sh one-liner something likesh -c 'mkdir /tmp/crunch-job-work /tmp/crunch-job-task-work; exec $ENV{CRUNCH_SRC}/crunch_scripts/$Job->{script}'
- (but with appropriate quoting)
Updated by Peter Amstutz about 9 years ago
New 7582-run-any-docker-container branch is pushed.
Updated by Tom Clegg about 9 years ago
at 22c45ae
Since this is now the first time we try to start a container, it's going to be annoying if we can't tell the difference between "can't run container" and "default user is no good". As is, it looks like a range of docker/container failures will be reported as "Docker image default user is 'root'" (and any hints from docker's stderr will be sent to /dev/null) which will be confusing. I see that it's inconvenient to get the stdout with the tools we have in front of us. Maybe it would be enough to un-hide stderr and change the error message to say "...or there was a problem running `id` in the container (see above)"?
How about: docker run [...] id --user | grep -qxv 0
Log(undef, "Looking up UID in container");
Since we're doing something a bit mysterious here (especially the "nobody" case, which will make stock images run very differently than they do anywhere else) we should also log what the eventual choice was. Technically you can find it later on in the "docker run" command line if you think of looking there, but an explicit "using --user=foo" or "using image's default user" would be much more friendly.
It looks like the "default user is non-root" case sets$dockeruser
to ""
, but that counts as failure here -- should be if (!defined $dockeruser)
?
if (!$dockeruser) { croak("Docker image default user is 'root', and does not have a non-root 'crunch' or 'nobody' user."); }
$dockeruser
should probably be called $dockeruserarg
(nit) you can nest "\Q\E"
to do stuff like "sh -c \Qmkdir \Q$foo\E && exec \Q$bar\E\E"
, which might be easier / more thorough than doing the \'\"
stuff manually. (The current version looks like it's fine, unless tmpdirs have quotation marks, which seems unlikely.)
Updated by Tom Clegg about 9 years ago
7582-crunch-runner @ 5fb633e
(first round of comments, haven't looked at all of upload*.go yet)
Move IArvadosClient from SDK to crunchrunner
Add signal handler test (maybe sigint a "sleep 10" command?) -- INT and KILL are the main ones we rely on day-to-day for cancel.
Add signal handler test for receiving a signal after process ended or assure me that cmd.Process will be nil after cmd.Run() returns so we don't send signals (it's possible Go does this for us, but I don't see it in the docs).
There's a possible race in "cmd.Process == nil" but it seems like it doesn't matter too much in practice. We already send many SIGINT from crunch-job, which would address the startup side, and it seems like the worst that can happen on the process-ending side is that we get a weird error when killing a job. So this is OK, right? (Is it worth a comment that we're deliberately accepting some harmless races?)
I think type TempFail struct{ InnerError error }
can be just type TempFail struct{error}
or even type TempFail error
, and you can delete the (TempFail)Error() func.
Looks like tempfail isn't actually used, so status could reduce to a success bool:
- const success = 1
- const permfail = 2
- const tempfail = 2
- var status int
+ var success bool
Behavior seems arbitrary when the exit code is listed in two or more of the *Codes lists. I'm inclined to say this should be the "most failed" of the interpretations specified rather than the most successful, or we should just check for contradictions up front and permfail if we find any.
return TempFail{nil}
seems sketchy. Since we only ever pass it to log.Print() I suppose it prints "nil" instead of crashing, but how about TempFail{fmt.Errorf("process exited %d", exitCode)}
?
Why umask?
gofmt
There's no need for stuff like log.Print("--TestUploadEmptyFile--")
. Just run go test -check.vv
, gocheck will print banners for you.
Test files & dirs with spaces, colons, newlines (perhaps cover all cases with a single file "./evil \n sub:dir/evil\n :file:name:" ...)
An opportunity:data := make([]byte, 1024*1024-1) - for i := 0; i < 1024*1024-1; i++ { + for i := range data {
Add to run-tests
Updated by Peter Amstutz about 9 years ago
Tom Clegg wrote:
7582-crunch-runner @ 5fb633e
(first round of comments, haven't looked at all of upload*.go yet)
Move IArvadosClient from SDK to crunchrunner
Done.
Add signal handler test (maybe sigint a "sleep 10" command?) -- INT and KILL are the main ones we rely on day-to-day for cancel.
Add signal handler test for receiving a signal after process ended or assure me that cmd.Process will be nil after cmd.Run() returns so we don't send signals (it's possible Go does this for us, but I don't see it in the docs).
There's a possible race in "cmd.Process == nil" but it seems like it doesn't matter too much in practice. We already send many SIGINT from crunch-job, which would address the startup side, and it seems like the worst that can happen on the process-ending side is that we get a weird error when killing a job. So this is OK, right? (Is it worth a comment that we're deliberately accepting some harmless races?)
Adjusted it so that it sets up the signal handlers after the process is started, and removes the signal handlers using signal.Stop()
after the process completes.
I think
type TempFail struct{ InnerError error }
can be justtype TempFail struct{error}
or eventype TempFail error
, and you can delete the (TempFail)Error() func.
Done.
Looks like tempfail isn't actually used, so status could reduce to a success bool:
[...]
Done.
Behavior seems arbitrary when the exit code is listed in two or more of the *Codes lists. I'm inclined to say this should be the "most failed" of the interpretations specified rather than the most successful, or we should just check for contradictions up front and permfail if we find any.
Reordered the list to permFail, tempFail, success.
return TempFail{nil}
seems sketchy. Since we only ever pass it to log.Print() I suppose it prints "nil" instead of crashing, but how aboutTempFail{fmt.Errorf("process exited %d", exitCode)}
?
Done.
Why umask?
run-command does it, but I don't think the rationale is any more specific than "minimize who can read the files out of an abundance of caution".
gofmt
There's no need for stuff like
log.Print("--TestUploadEmptyFile--")
. Just rungo test -check.vv
, gocheck will print banners for you.
That's useful. Removed the banners.
Test files & dirs with spaces, colons, newlines (perhaps cover all cases with a single file "./evil \n sub:dir/evil\n :file:name:" ...)
Done.
An opportunity:
- [...]
Done.
Add to run-tests
7582-crunch-runner in arvados-dev
Updated by Peter Amstutz about 9 years ago
Crunch runner initial documentation page: https://dev.arvados.org/projects/arvados/wiki/Crunch_runner
Updated by Tom Clegg about 9 years ago
7582-run-any-docker-container at dc3915c
I'm guessing this was meant to be
-$docker_bin run --rm --user=$try_user ...
+$docker_bin run --rm $try_user_arg ...
Other than that LGTM
Updated by Peter Amstutz about 9 years ago
Tom Clegg wrote:
7582-run-any-docker-container at dc3915c
I'm guessing this was meant to be
[...]Other than that LGTM
Fixed, also removed "stdbuf" from the "minimal execution" mode. (This adjusts glibc to turn off stdout buffering, which is helpful for logging, but not guaranteed to be installed).
Updated by Tom Clegg about 9 years ago
7582-crunch-runner at 5d3de21
Need to check whether err returned by cmd.Start() is nil
(*ManifestStreamWriter)Write()
should probably return a non-nil error in case anyone tries to use it. Or... might as well implement it,
return m.ReadFrom(bytes.NewReader(p))
How about
- for true {
- select {
- case block, valid := <-uploader:
- if !valid {
- finish <- errors
- return
- }
- ...
- }
- }
+ for block := range uploader {
+ ...
+ }
+ finish <- errors
Godoc says a successful io.Copy returns err==nil, not EOF, so it seems like we shouldn't ignore err==EOF in WalkFunc.
It looks like symlinked files are included in the output. Is this intentional? (Should document on the wiki page.) If so, is it worth paying special attention to them in the "(%v bytes)" log?
if count > 0 {
looks superfluous in (*ManifestStreamWriter)ReadFrom() -- that's the only way m.Block.offset can have reached BLOCKSIZE.
In Finish(), please unpyramid as if v.upload == nil { continue }
...and "v" should probably be "s" for "stream"?
In Finish(), if errors != nil {
looks superfluous (the range
block will just run zero times if the slice is nil).
Peter Amstutz wrote:
Adjusted it so that it sets up the signal handlers after the process is started, and removes the signal handlers using
signal.Stop()
after the process completes.
Hm. I'm having second thoughts about whether it's OK to have a window (between starting the child and deciding to forward signals) where we don't propagate signals. Would that leave our docker container running with a zombie process, or something? (I assume so, because otherwise we wouldn't need the signal handling at all, right?)
I think doing things in this order would make us safe from startup races:- make channel and call signal.Notify
- start the process (if we get a signal right now, it will wait in the channel until we're ready to forward it)
- start the goroutine that listens for signals and forwards them
Why umask?
run-command does it, but I don't think the rationale is any more specific than "minimize who can read the files out of an abundance of caution".
Generally "leave things alone" seems to be our goal here, but AFAICT the default starting umask in docker seems to be "whatever the docker daemon's umask was on the host", which isn't predictable/portable, and predictable seems better than not.
The only situation I can think of where this would make a difference (given we expect everything in the container to run as the same user) is a job that writes some files and then makes a tarball of them in the output dir. The tarball would then have unconventionally strict permissions. And that doesn't sound so bad.
Should probably document umask on the wiki page.
Rest LGTM
Updated by Tom Clegg about 9 years ago
Peter Amstutz wrote:
Fixed, also removed "stdbuf" from the "minimal execution" mode. (This adjusts glibc to turn off stdout buffering, which is helpful for logging, but not guaranteed to be installed).
Commit message says stdbuf "solves problem of stdout showing up out of order and with incorrect time stamps in logs."
Any idea whether running stdbuf outside the container has the same effect?
Updated by Peter Amstutz about 9 years ago
Tom Clegg wrote:
Peter Amstutz wrote:
Fixed, also removed "stdbuf" from the "minimal execution" mode. (This adjusts glibc to turn off stdout buffering, which is helpful for logging, but not guaranteed to be installed).
Commit message says stdbuf "solves problem of stdout showing up out of order and with incorrect time stamps in logs."
Any idea whether running stdbuf outside the container has the same effect?
Details:
stdbuf modifies the behavior of glibc FILE*; it's entirely a userspace concept. Right now we're setting --output=0 --error=0
(no buffering) but actually it would be better to use --output=L --error=L
(line buffering, since it's going to be line buffered by crunchstat and crunch-job anyway).
From looking at coreutils source, the approach seems to be that it sets LD_PRELOAD to libstdbuf.so and then executes the subprocess with _STDBUF_E, _STDBUF_I and _STDBUF_O in the environment. libstdbuf.so has a ELF "constructor" function which actually modifies the buffering mode during program initialization.
$ stdbuf -o0 env _STDBUF_O=0 LD_PRELOAD=/usr/lib/coreutils/libstdbuf.so
_STDBUF_E, _STDBUF_I and _STDBUF_O are specific to stdbuf and not recognized by glibc generally.
This means stdbuf settings can propagate down to subprocesses, provided they don't clear the environment, but won't (and shouldn't) propagate into the Docker environment unless we add --env, but that only works if we know that libstdbuf.so
is inside the container (but if that is the case, the stdbuf
program is almost certainly already present.)
We could enhance the minimal /bin/sh bootstrap to use `which stdbuf`.
An alternate approach is to allocate a PTY with TERM=dumb, which modifies the buffering (I think), however some programs modify their behavior based on whether they are printing to PTY or a pipe (for example whether they print live status bars.)
Found this gem in stdbuf.c:
switch (c) { /* Old McDonald had a farm ei... */ case 'e': case 'i': case 'o':
Updated by Peter Amstutz about 9 years ago
Tom Clegg wrote:
7582-crunch-runner at 5d3de21
Need to check whether err returned by cmd.Start() is nil
(*ManifestStreamWriter)Write()
should probably return a non-nil error in case anyone tries to use it. Or... might as well implement it, [...]
Didn't realize it was that easy :-) Done.
How about
[...]Godoc says a successful io.Copy returns err==nil, not EOF, so it seems like we shouldn't ignore err==EOF in WalkFunc.
That was my mistake, I was allowing EOF to propagate from ReadFrom() which was incorrect.
It looks like symlinked files are included in the output. Is this intentional? (Should document on the wiki page.) If so, is it worth paying special attention to them in the "(%v bytes)" log?
That's correct in that it is intentional but not ideal. The Python code does some fancy footwork to identify symlinks to keep mounted files and merge directly into the manifest, however it relies on infrastructure we have in the Python SDK that we don't have in the Go SDK so porting the logic to Go will require writing a lot more code.
if count > 0 {
looks superfluous in (*ManifestStreamWriter)ReadFrom() -- that's the only way m.Block.offset can have reached BLOCKSIZE.
That's right, fixed.
In Finish(), please unpyramid as
if v.upload == nil { continue }
...and "v" should probably be "s" for "stream"?
Done.
In Finish(),
if errors != nil {
looks superfluous (therange
block will just run zero times if the slice is nil).
Done.
Peter Amstutz wrote:
Adjusted it so that it sets up the signal handlers after the process is started, and removes the signal handlers using
signal.Stop()
after the process completes.Hm. I'm having second thoughts about whether it's OK to have a window (between starting the child and deciding to forward signals) where we don't propagate signals. Would that leave our docker container running with a zombie process, or something? (I assume so, because otherwise we wouldn't need the signal handling at all, right?)
I think doing things in this order would make us safe from startup races:
- make channel and call signal.Notify
- start the process (if we get a signal right now, it will wait in the channel until we're ready to forward it)
- start the goroutine that listens for signals and forwards them
The sequence is now:
- Create command object
- Set up signal notify channel
- Start process
- Start goroutine
- Goroutine forwards signals to process and records the last signal that was received (if any)
- Wait for process to terminate
- Stop signal notify channel
- Check if any signals were received and PermFail if so.
Why umask?
run-command does it, but I don't think the rationale is any more specific than "minimize who can read the files out of an abundance of caution".
Generally "leave things alone" seems to be our goal here, but AFAICT the default starting umask in docker seems to be "whatever the docker daemon's umask was on the host", which isn't predictable/portable, and predictable seems better than not.
The only situation I can think of where this would make a difference (given we expect everything in the container to run as the same user) is a job that writes some files and then makes a tarball of them in the output dir. The tarball would then have unconventionally strict permissions. And that doesn't sound so bad.
Should probably document umask on the wiki page.
I reintroduced umask (I had removed it) and set it to 0022.
On further consideration, umask 0077 only provides isolation if we also run containers with guaranteed different host UIDs and reopening the discussion about whether to use UIDs that don't correspond to users inside the container. Since we're currently not doing that, we might as well go with the least surprising behavior.
Updated by Tom Clegg about 9 years ago
7582-crunch-runner at 0fa7cd5
nit: update comment in setupSignals now that it doesn't forward signals any more (and maybe move comment above function per godoc convention)
defer signal.Stop(sigChan)
won't happen until runner() returns. It looks like the intent is to Stop()
between Wait()
and if caughtSignal != nil
. It seems the worst thing that can happen is a more ungraceful crash when we get SIGINT during WriteTree, so perhaps this isn't a big deal? OTOH if we do want to be sure we've consumed sigChan before checking caughtSignal, we could do something like
done := make(chan struct{})
...above goroutineclose(done)
...in the goroutine, after range is donesignal.Stop(sigChan); close(sigChan); <-done
...after Wait()
Rest of the changes LGTM
Updated by Peter Amstutz about 9 years ago
Tom Clegg wrote:
7582-crunch-runner at 0fa7cd5
nit: update comment in setupSignals now that it doesn't forward signals any more (and maybe move comment above function per godoc convention)
Fixed.
defer signal.Stop(sigChan)
won't happen until runner() returns. It looks like the intent is toStop()
betweenWait()
andif caughtSignal != nil
.
That's correct, I was just confused about the scoping of defer
.
It seems the worst thing that can happen is a more ungraceful crash when we get SIGINT during WriteTree, so perhaps this isn't a big deal?
The intent is to be able to still cancel the process during WriteTree(), if for some reason the upload gets stuck.
OTOH if we do want to be sure we've consumed sigChan before checking caughtSignal, we could do something like
done := make(chan struct{})
...above goroutineclose(done)
...in the goroutine, after range is donesignal.Stop(sigChan); close(sigChan); <-done
...after Wait()
Done.
Rest of the changes LGTM
Updated by Peter Amstutz about 9 years ago
- Status changed from In Progress to Resolved