Feature #4027
closed[Crunch] Accept SDK version as a runtime constraint. Install SDK into the docker container before running tasks.
Added by Ward Vandewege about 10 years ago. Updated about 10 years ago.
Description
During jobs.create and jobs.update, runtime_constraints['arvados_sdk_version']
provided by the client should be resolved to a commit hash and stored in arvados_sdk_hash
.
Example:
In jobs.create request:"runtime_constraints":{ "arvados_sdk_version":"master" }
"arvados_sdk_hash":"7b2d04380952ac79453bd0771679e40c81281f5c" "runtime_constraints":{ "arvados_sdk_version":"master" }
cd {path-to-arvados-tree} git reset --hard {arvados_sdk_hash} cd sdk ./install.sh
The sdk/install.sh
program should install Python, Ruby, and Perl client libraries to the system package directories where they can be imported by Crunch tasks.
Updated by Ward Vandewege about 10 years ago
- Subject changed from Add sdk_version as a job parameter. Takes a git hash. Crunch should install that sdk version into the docker image before running the task. to [Crunch] Add sdk_version as a job parameter. Takes a git hash. Crunch should install that sdk version into the docker image before running the task.
Updated by Tom Clegg about 10 years ago
- Subject changed from [Crunch] Add sdk_version as a job parameter. Takes a git hash. Crunch should install that sdk version into the docker image before running the task. to [Crunch] Accept SDK version as a runtime constraint. Install SDK into the docker container before running tasks.
- Description updated (diff)
Updated by Ward Vandewege about 10 years ago
- Target version changed from Arvados Future Sprints to 2014-10-29 sprint
Updated by Tim Pierce about 10 years ago
- Category set to Crunch
- Assigned To set to Tim Pierce
Updated by Tim Pierce about 10 years ago
- Target version changed from 2014-10-29 sprint to Arvados Future Sprints
Updated by Ward Vandewege about 10 years ago
- Target version changed from Arvados Future Sprints to 2014-11-19 sprint
Updated by Brett Smith about 10 years ago
- Assigned To changed from Tim Pierce to Brett Smith
Proposed implementation. After the normal Docker image install, use docker run
to install the exact specified version in the downloaded image, then run the job in the generated image.
Updated by Brett Smith about 10 years ago
4027-api-sdk-version-wip is up for review. It implements the API server changes necessary to make it go. Once it's merged, I'll work on Crunch to finish the story.
One small discrepancy with the description: I named the new column arvados_sdk_version instead of arvados_sdk_hash. I didn't want to commit us for eternity to identifying SDK versions by commit hashes, so I made the name more generic. (It would be pretty great if we outlived git!)
Updated by Tom Clegg about 10 years ago
At db8f547...
The only complaint I can find is that the arvados repository fixture ("owned by All Users group") doesn't quite match what we suggest in the install docs ("owned by root, readable by All Users group"). (The latter sounds better to me.) Possible remedies:- Make the fixture match the docs (less scope creep, probably better for now)
- Create it automatically at runtime instead, like system_user et al., and perhaps clone the repo automatically during install (maybe a better long term plan?)
class AddArvadosSdkVersionToJobs < ActiveRecord::Migration def change add_column :jobs, :arvados_sdk_version, :string end end
Nice fixes in git_test_helper. (Looks like someone didn't know about tar -C
!)
Updated by Brett Smith about 10 years ago
Tom Clegg wrote:
At db8f547...
The only complaint I can find is that the arvados repository fixture ("owned by All Users group") doesn't quite match what we suggest in the install docs ("owned by root, readable by All Users group").
I looked at 4xphq, where you own the arvados repo, and figured it didn't matter too much. :) But I'm happy to follow best practices. Updated to be owned by root and add a can_read permission.
Mostly out of curiosity: is the long form migration better?
Don't read too much into it. At one point, I had trouble rolling back the short form migration, but that was a little while ago and I can't rule out the possibility that I had another problem. At this point I'm admittedly copying the last migration I did and adapting it as necessary.
Now at be3c438
Updated by Tom Clegg about 10 years ago
Brett Smith wrote:
Don't read too much into it. At one point, I had trouble rolling back the short form migration, but that was a little while ago and I can't rule out the possibility that I had another problem. At this point I'm admittedly copying the last migration I did and adapting it as necessary.
Gotcha. In case it's helpful next time "bundle exec rails g migration AddArvadosSdkVersionToJobs arvados_sdk_version:string"
should produce a reasonable migration, possibly quicker than copy&paste.
(Next we need "rails g fixture"...)
Now at be3c438
LGTM, thanks.
Updated by Peter Amstutz about 10 years ago
In order to support common workflow language tool descriptions, it will essential to be able to take Docker images containing applications and configure them to run under Arvados. This means there is no assumption that mount points like /keep
exist, or that there is a user named "crunch". These things need to be provisioned before the container can run crunch scripts.
Updated by Radhika Chippada about 10 years ago
Review feedback:
- I could not tell if such a test already exists; if not, does it make sense to add a test that fails on job->find_arvados_sdk_version by passing an invalid / incorrect sdk version?
- Wondering if we could add (probably an integration) test that creates a job with a specific version, read it and verify version, update it to a different version, read back and verify? It appeared that the functional tests were doing one step per test (as they should be).
- Does it make sense to add a job fixture with a specific version and test being able to read it and update it? I am thinking this would be the similar to reading a job that was previously created with a sdk version.
- Does it make sense to add a workbench integration test to verify the sdk version of this job fixture?
- Not sure if it would be possible / meaningful to add tests for the update for the method "find_docker_image_locator"?
- This is really small issue, but was wondering if we really need 255 bytes for the newly added job attribute (sdk version)?
- Everything else looked good. Thanks.
Updated by Tom Clegg about 10 years ago
4027-sdk-debian-packages-wip @ commit:ab8b92d
This variable could probably be renamed toPATCH_VERSION
:
-GIT_HASH=`git log --first-parent --max-count=1 --format=format:$HUMAN_READABLE_TIMESTAMP.%h -n1 .` +GIT_HASH=$(version_from_git)
The rest LGTM, with the caveat that I haven't actually tried running it yet to convince myself the Perl package builds...
Updated by Brett Smith about 10 years ago
Tom Clegg wrote:
This variable could probably be renamed to
PATCH_VERSION
:
You're proving why it should be renamed: it used to be PATCH_VERSION, but now it's the entire package version. Renamed to PKG_VERSION and pushed at adb6ea8. Thanks.
Updated by Brett Smith about 10 years ago
Radhika,
Thanks for the feedback. I believe many of the tests you're requesting are covered by unit tests in test/unit/job_test.rb
. Details below.
Radhika Chippada wrote:
Review feedback:
- I could not tell if such a test already exists; if not, does it make sense to add a test that fails on job->find_arvados_sdk_version by passing an invalid / incorrect sdk version?
This should be covered by "creating job with SDK version '#{search}'" when search == "__nonexistent__"
.
- Wondering if we could add (probably an integration) test that creates a job with a specific version, read it and verify version, update it to a different version, read back and verify? It appeared that the functional tests were doing one step per test (as they should be).
Great suggestion. Added as a unit test in fc1a628.
- Does it make sense to add a job fixture with a specific version and test being able to read it and update it? I am thinking this would be the similar to reading a job that was previously created with a sdk version.
After implementing the test in the last suggestion, I'm not sure what additional Arvados code this would test. If I wrote the test right, I don't think there should be much difference between "create a job with an SDK constraint, update the constraint, and check the update" and "load a fixture with an SDK constraint, update the constraint, and check the update"—it would just check that Rails loads fixtures correctly.
- Does it make sense to add a workbench integration test to verify the sdk version of this job fixture?
No Workbench functionality has been added in this ticket yet, so I'm not sure what it would test. I think the only place it would appear is in the job's "API response," which I believe is already tested generally.
- Not sure if it would be possible / meaningful to add tests for the update for the method "find_docker_image_locator"?
This should already be covered by the various Docker-related tests in job_test.rb
. This method has existed for a while now; this branch just refactored it, so it could share common code with find_arvados_sdk_version
.
- This is really small issue, but was wondering if we really need 255 bytes for the newly added job attribute (sdk version)?
It's a fair point, but that has already been pushed and deployed on 4xphq. Unfortunately I think trying to roll it back now would be more trouble than it's worth.
Ready for another look at fc1a628. Thanks.
Updated by Radhika Chippada about 10 years ago
Brett, sorry for the review confusion. I did start with reviewing the correct branch but made a mistake in the end and ended up reviewing wrong branch. So, now I reviewed the sdk branch also. Just a few comments below, none major. You can address or not, and merge without waiting on me anymore. Thanks.
- In job_controller -> load_job_specific_filters -> @filters.select! do |(col, op, val)|
Elsewhere in our code we used the variable name "attr" instead of "col". Wondering if you would like to name it attr?
- Regarding the comment "# Using the HEAD default is fine." in the rescue block. I am sure you made sure this works :). Just making sure in this case it would default to HEAD?
- I think "previous_job_run_with_arvados_sdk_version" fixture can simply be "job_with_arvados_sdk_version"
- In JobReuseControllerTest, I am wondering if the variable names filter_h and filter_a can be more descriptive. I am thinking these stand for filter_hash and filter_added or something?
- And you already added the extra test that I suggested. Thanks.
Updated by Brett Smith about 10 years ago
Radhika Chippada wrote:
Brett, sorry for the review confusion. I did start with reviewing the correct branch but made a mistake in the end and ended up reviewing wrong branch.
Radhika,
No worries about that at all. I know this ticket has been confusing with multiple branches (there will be four by the time this is all over), and I'm happy to have the extra pair of eyes for the first one. Thanks for following up.
- In job_controller -> load_job_specific_filters -> @filters.select! do |(col, op, val)|
Elsewhere in our code we used the variable name "attr" instead of "col". Wondering if you would like to name it attr?
You're absolutely right. I have changed all of these variable names to match what we use in related loops like ApiClientAuthorizationsController: attr, operator, operand.
- Regarding the comment "# Using the HEAD default is fine." in the rescue block. I am sure you made sure this works :). Just making sure in this case it would default to HEAD?
Yes, this is a little Ruby magic, but the git_filters hash has a default value that's set the first time any item is accessed, and that sets "max_version" => "HEAD"
. Refer to the git_filters = Hash.new …
line early in the method; that's where the default value is defined. I updated the comment to try to clarify what's going on here; let me know if that's helpful.
- I think "previous_job_run_with_arvados_sdk_version" fixture can simply be "job_with_arvados_sdk_version"
We use the "previous_job_run" pattern for all the jobs that are in the completed state, which are primarily tested by the job reuse tests. This fixture is in the same boat, so I think the consistency is helpful, even if it is a little long.
- In JobReuseControllerTest, I am wondering if the variable names filter_h and filter_a can be more descriptive. I am thinking these stand for filter_hash and filter_added or something?
I meant filter_hash and filter_array, but you're right this is a little too brief. Changed to filters_hash
and filters
(since the array corresponds to what literally gets passed in to the API request), respectively.
Ready for another look at bc47a4a. Thanks.
Updated by Brett Smith about 10 years ago
- Target version changed from 2014-11-19 sprint to Arvados Future Sprints
Updated by Brett Smith about 10 years ago
- Target version changed from Arvados Future Sprints to 2014-11-19 sprint
Updated by Brett Smith about 10 years ago
- Target version changed from 2014-11-19 sprint to 2014-12-10 sprint
Updated by Tom Clegg about 10 years ago
- Status changed from New to In Progress
- Target version changed from 2014-12-10 sprint to 2014-11-19 sprint
Updated by Brett Smith about 10 years ago
- Target version changed from 2014-11-19 sprint to 2014-12-10 sprint
Updated by Radhika Chippada about 10 years ago
Review feedback for branch :
- Just one question. Should this test name in "unit/job_test.rb" say "without Docker image" instead of "with Docker image"?
test "can't specify SDK version with Docker image" do
- Everything else looks good. Thanks.
Updated by Peter Amstutz about 10 years ago
Reviewing 4027-crunch-sdk-install-wip 0e9c824
- Thank you for refactoring the increasingly monolithic
start_jobs
method. - Since you're already refactoring the git code in crunch-dispatch, it would be better to run git with
popen
so we don't have to shell escape everything. - A style tweak would make this logic clearer on line 271 of crunch-dispatch.rb (I realize I probably wrote this code originally ;-)
src_repo = File.join(@repo_root, repo_name + '.git')
tosrc_repo = File.join(@repo_root, "#{repo_name}.git")
- In crunch-job, prefer "source" to "." on line 1825 for readability
- I get this error:
4n8aq-8i9sb-i4rk2oysvvc9mik 27772 0 stderr crunchstat: Running [/usr/bin/docker.io run --rm=true --attach=stdout --attach=stderr --attach=stdin -i --user=crunch --cidfile=/tmp/crunch-job-1001/4n8aq-ot0gb-i4f91shujv1gsk2.cid --sig-proxy --dns 10.13.4.125 --dns 172.17.42.1 --volume=/tmp/crunch-job-1001/src:/tmp/crunch-job-1001/src:ro --volume=/tmp/crunch-job-1001/opt:/tmp/crunch-job-1001/opt:ro --volume=/tmp/crunch-job-1001/task/localhost.1.keep:/keep:ro --volume=/tmp --env=JOB_PARAMETER_COMMAND=ARRAY(0x32c2cd8) --env=CRUNCH_WORK=/tmp/crunch-job-1001/work --env=TASK_QSEQUENCE=0 --env=CRUNCH_SRC_COMMIT=b77fb69914fda0aedecddf2e62076cb404265647 --env=ARVADOS_API_TOKEN=3ehs221j7jcdw7igofn0vztkwhblir8lqd4wsomn6ds05d9iz3 --env=TASK_UUID=4n8aq-ot0gb-i4f91shujv1gsk2 --env=CRUNCH_INSTALL=/tmp/crunch-job-1001/opt --env=CRUNCH_JOB_UUID=4n8aq-8i9sb-i4rk2oysvvc9mik --env=CRUNCH_NODE_SLOTS=4 --env=ARVADOS_API_HOST=petere1:3001 --env=CRUNCH_SRC=/tmp/crunch-job-1001/src --env=TASK_TMPDIR=/tmp/crunch-job-task-work/localhost.1 --env=JOB_SCRIPT=run-command --env=CRUNCH_TMP=/tmp/crunch-job-1001 --env=TASK_KEEPMOUNT=/keep --env=JOB_UUID=4n8aq-8i9sb-i4rk2oysvvc9mik --env=TASK_SLOT_NODE=localhost --env=ARVADOS_API_HOST_INSECURE=true --env=JOB_WORK=/tmp/crunch-job-work --env=TASK_WORK=/tmp/crunch-job-task-work/localhost.1 --env=CRUNCH_SRC_URL=/home/peter/work/_arvados_internal.git --env=TASK_SEQUENCE=0 --env=TASK_SLOT_NUMBER=1 --env=CRUNCH_REFRESH_TRIGGER=/tmp/crunch_refresh_trigger --env=CRUNCH_JOB_BIN=/home/peter/work/arvados/services/crunch/crunch-job --env=HOME=/tmp/crunch-job-task-work/localhost.1 777ef687a8811f22fdd7c615be9356a92ce5f2150ff481bd368def31eae1bc15 stdbuf --output=0 --error=0 perl - /tmp/crunch-job-1001/src/crunch_scripts/run-command] 2014-11-26 14:42:38 -0500 -- pipeline_instance 4n8aq-d1hrv-cmildf07v92ddee command 4n8aq-8i9sb-i4rk2oysvvc9mik {:failed=>0, :done=>0, :running=>1, :todo=>0} 4n8aq-8i9sb-i4rk2oysvvc9mik 27772 0 stderr crunchstat: reading stats from /sys/fs/cgroup/memory/memory.stat 4n8aq-8i9sb-i4rk2oysvvc9mik 27772 0 stderr crunchstat: mem 5393547264 cache 0 swap 8059 pgmajfault 4882997248 rss 4n8aq-8i9sb-i4rk2oysvvc9mik 27772 0 stderr crunchstat: reading stats from /sys/fs/cgroup/cpuacct/cpuacct.stat 4n8aq-8i9sb-i4rk2oysvvc9mik 27772 0 stderr crunchstat: reading stats from /sys/fs/cgroup/cpuset/cpuset.cpus 4n8aq-8i9sb-i4rk2oysvvc9mik 27772 0 stderr crunchstat: cpu 245540.6500 user 12745.2200 sys 4 cpus 4n8aq-8i9sb-i4rk2oysvvc9mik 27772 0 stderr crunchstat: reading stats from /sys/fs/cgroup/blkio/blkio.io_service_bytes 4n8aq-8i9sb-i4rk2oysvvc9mik 27772 0 stderr crunchstat: blkio:8:32 15360 write 79734784 read 4n8aq-8i9sb-i4rk2oysvvc9mik 27772 0 stderr crunchstat: blkio:8:16 270336 write 72303616 read 4n8aq-8i9sb-i4rk2oysvvc9mik 27772 0 stderr crunchstat: blkio:8:0 188539445248 write 4412491776 read 4n8aq-8i9sb-i4rk2oysvvc9mik 27772 0 stderr crunchstat: reading stats from /sys/fs/cgroup/cpuacct/cgroup.procs 4n8aq-8i9sb-i4rk2oysvvc9mik 27772 0 stderr crunchstat: net:docker0 1296 tx 2192 rx 4n8aq-8i9sb-i4rk2oysvvc9mik 27772 0 stderr crunchstat: net:eth1 996622231 tx 1353980473 rx 4n8aq-8i9sb-i4rk2oysvvc9mik 27772 0 stderr python: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.15' not found (required by python) 4n8aq-8i9sb-i4rk2oysvvc9mik 27772 0 stderr python: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.14' not found (required by python) 4n8aq-8i9sb-i4rk2oysvvc9mik 27772 0 child 27865 on localhost.1 exit 1 success= 4n8aq-8i9sb-i4rk2oysvvc9mik 27772 0 failure (#1, permanent) after 2 seconds
Updated by Peter Amstutz about 10 years ago
Per our discussion on IRC, the virtualenv is being set up based on the host's python rather than the docker container's python, which can fail in a variety of ways due to software mismatches.
Also, it appears that $CRUNCH_INSTALL is not being cleaned up in the "clean work dirs" step, which results in another error when it tries to set up a virtualenv where one already exists.
Updated by Brett Smith about 10 years ago
Peter Amstutz wrote:
- In crunch-job, prefer "source" to "." on line 1825 for readability
I agree completely about readability. Unfortunately, "." is better for portability. "source" is not POSIX and not implemented in Debian's dash.
I implemented all your other suggestions as of 48cec3a. The $CRUNCH_INSTALL issue is rendered moot by moving the install step to inside Docker.
Updated by Peter Amstutz about 10 years ago
Functionally the code looks good. However, I would like to see a litte bit more logging in the __DATA__ bootstrap script in crunch-job. Right now it doesn't log anything about setting up the virtualenv unless there was some kind of error. I think at least one line saying "installing SDK version XYZ in virtualenv" would make it less mysterious.
Also, make sure you update the SDK version pin in services/api/Gemfile.
Updated by Brett Smith about 10 years ago
Peter Amstutz wrote:
Functionally the code looks good. However, I would like to see a litte bit more logging in the __DATA__ bootstrap script in crunch-job. Right now it doesn't log anything about setting up the virtualenv unless there was some kind of error. I think at least one line saying "installing SDK version XYZ in virtualenv" would make it less mysterious.
Added in 53a3b79, along with some other clean-ups to make output handling more consistent with previous versions. Because the virtualenv build happens inside Docker, I made it so the logging only happens during task 0, to prevent the same output from appearing over and over again in the logs.
Also, make sure you update the SDK version pin in services/api/Gemfile.
I know, but I believe I have to do this after I merge (but before I push). Jenkins will build gems based on the merge commit, and I've seen gem get grumpy when a Gemfile refers to gems that don't exist (which they won't if I try to refer to my intermediate commits).
Updated by Peter Amstutz about 10 years ago
- As discussed on IRC, it would be helpful if crunch-dispatch could log the version of the arvados-cli gem it is using to run crunch-job, for debugging and reproducibility.
- I got "Warning: Python SDK installed, but can't build virtualenv" which does not tell me what went wrong.
Updated by Brett Smith about 10 years ago
Peter Amstutz wrote:
- As discussed on IRC, it would be helpful if crunch-dispatch could log the version of the arvados-cli gem it is using to run crunch-job, for debugging and reproducibility.
Looking further, I don't think crunch-dispatch can do this, since it has no write access to the job log file. Added some basic diagnostics to crunch-job instead, reporting the full path (since crunch-dispatch accepts any crunch-job script defined in $CRUNCH_JOB_BIN) and available Gem information. It ignores errors for the sake of local development runs.
- I got "Warning: Python SDK installed, but can't build virtualenv" which does not tell me what went wrong.
It means virtualenv can't be run inside the Docker image—probably because it isn't installed, but it could also be caused by a bad $PATH, bad permissions, etc. Clarified the log message. Now at 188a9dd.
Updated by Peter Amstutz about 10 years ago
- It appears that virtualenv gets installed in the "shell" and "compute" Docker images, but not arvados/jobs ? That explains why I can't test this successfully, I am using a default arvados/jobs image from a few weeks ago.
- For the log statement bikeshed, I suggest "Warning: virtualenv not found in Docker container default $PATH. Cannot install Python SDK."
- For reporting the gem versions, I get:
/home/peter/work/arvados/sdk/cli/bin/crunch-job with arvados-cli Gem version(s) 0.1.20141202211726, 0.1.20141118163122, 0.1.20141112222031, 0.1.20141017191701, 0.1.20141014201516
Telling me the path it is using is very helpful, telling me the gems that are installed without knowing which one is actually active, less so - consider taking that out.
Updated by Brett Smith about 10 years ago
Peter Amstutz wrote:
- It appears that virtualenv gets installed in the "shell" and "compute" Docker images, but not arvados/jobs ? That explains why I can't test this successfully, I am using a default arvados/jobs image from a few weeks ago.
- For the log statement bikeshed, I suggest "Warning: virtualenv not found in Docker container default $PATH. Cannot install Python SDK."
Both done as of 046dc01.
- For reporting the gem versions, I get:
[...]
Telling me the path it is using is very helpful, telling me the gems that are installed without knowing which one is actually active, less so - consider taking that out.
Fixed by having crunch-dispatch invoke crunch-job with bundle exec
, so they see the same Ruby environment, and only one version should normally be listed. This is probably better for consistency anyway, and won't leak through to Docker so we don't have to worry too much about it messing up those jobs. It does affect non-Docker jobs, but we haven't been guaranteeing much about the Ruby setup there either, so I don't think that's an issue.
Updated by Brett Smith about 10 years ago
Brett Smith wrote:
Fixed by having crunch-dispatch invoke crunch-job with
bundle exec
, so they see the same Ruby environment, and only one version should normally be listed. This is probably better for consistency anyway, and won't leak through to Docker so we don't have to worry too much about it messing up those jobs. It does affect non-Docker jobs, but we haven't been guaranteeing much about the Ruby setup there either, so I don't think that's an issue.
I have more proof that this is OK to do, but also found an unrelated bug here. Check out qr1hi-8i9sb-uddf0fqrj36ayui, which ran in production and just prints its environment and tries to list active gems. It clearly shows that Bundler is active in the current environment. I believe this is happening because in production, we have sudo configured to preserve the environment for Crunch. It looks like that probably isn't set up in your development environment, or in Docker. But we're already running with it, so making it explicit with bundle exec
is relatively safe, and would just serve to bring development environments and production closer together.
The bug here is that this theoretically messes things up for jobs. The script tried to output the list of arvados-cli gems. That generated no output, because the Bundler environment lived all the way through to the compute nodes, where it no longer exists. Ruby is limiting itself to a nonexistent Gem set. The proper way to fix this would be to have crunch-job clean the environment for non-Docker jobs. I'm not going to tackle that in this branch, because I'm not sure it's important enough to do under crunch-job specifically rather than wait for v2, or whatever.
Updated by Peter Amstutz about 10 years ago
I get "sudo: bundle: command not found" when I try to run a job now. Maybe use "which" to get the right path? However, since we're at the end of the sprint, I'm also okay with backing out this feature entirely.
Everything else LGTM.
Updated by Brett Smith about 10 years ago
- Status changed from In Progress to Resolved
- % Done changed from 84 to 100
Applied in changeset arvados|commit:d980949ac4c092a44f3b64fb7cbd4a27a49256fb.