Bug #8828
closed[Crunch] be more resilient when crunchrunner is not available; also don't test for crunchrunner on api server
Description
I noticed two small issues introduced after the changes in 8815 - I should have spotted them in review, sorry. Specifically, we had the diagnostics fail on c97qk with:
2016-03-30_01:08:05 c97qk-8i9sb-etuzc7s3j88iym9 1470 0 stderr Running [docker.io run --name=c97qk-ot0gb-n67uuhx6ng9hzib-0 --attach=stdout --attach=stderr --attach=stdin -i --cidfile=/tmp/crunch-job/c97qk-ot0gb-n67uuhx6ng9hzib-0.cid --sig-proxy --memory=3346971k --memory-swap=3346971k --volume=/tmp/crunch-job/src:/tmp/crunch-job/src:ro --volume=/tmp/crunch-job/opt:/tmp/crunch-job/opt:ro --volume=/tmp/crunch-job/task/compute3.1.keep/by_pdh:/keep:ro --volume=/tmp/crunch-job/task/compute3.1.keep/tmp:/keep_tmp --volume=/tmp --volume=:/usr/local/bin/crunchrunner --volume=/etc/ssl/certs/ca-certificates.crt:/etc/arvados/ca-certificates.crt --env=TASK_KEEPMOUNT_TMP=/keep_tmp --env=CRUNCH_GIT_ARCHIVE_HASH=8e18a89ea517fde50d24eb17b884bc86 --env=CRUNCH_SRC=/tmp/crunch-job/src --env=JOB_UUID=c97qk-8i9sb-etuzc7s3j88iym9 --env=TASK_QSEQUENCE=0 --env=CRUNCH_REFRESH_TRIGGER=/tmp/crunch_refresh_trigger --env=ARVADOS_API_HOST=c97qk.arvadosapi.com --env=TASK_TMPDIR=/tmp/crunch-job-task-work/compute3.1 --env=JOB_WORK=/tmp/crunch-job-work --env=CRUNCH_TMP=/tmp/crunch-job --env=TASK_SLOT_NODE=compute3 --env=CRUNCH_SRC_URL=/var/lib/arvados/internal.git --env=JOB_SCRIPT=hash --env=CRUNCH_WORK=/tmp/crunch-job/work --env=CRUNCH_NODE_SLOTS=1 --env=TASK_SEQUENCE=0 --env=TASK_WORK=/tmp/crunch-job-task-work/compute3.1 --env=JOB_PARAMETER_INPUT=1724fc6b2145c148b894a8da81132ef8+53 --env=ARVADOS_API_TOKEN=42qrvz14riharlxo9qqdighalbu1022iuoyrlj859nbfx8bfyk --env=CRUNCH_JOB_BIN=/usr/local/arvados/src/services/crunch/crunch-job --env=TASK_UUID=c97qk-ot0gb-n67uuhx6ng9hzib --env=TASK_SLOT_NUMBER=1 --env=TASK_KEEPMOUNT=/keep --env=CRUNCH_JOB_UUID=c97qk-8i9sb-etuzc7s3j88iym9 --env=CRUNCH_SRC_COMMIT=4d4c3442e04310d7a88894c105a7cf351fd9f373 --env=CRUNCH_INSTALL=/tmp/crunch-job/opt --env=HOME=/tmp/crunch-job-task-work/compute3.1 f30fae7189adac0948eef3b3386e9ef254f69a8187f9eab99004e2d3650605cd /bin/sh -c python -c "from pkg_resources import get_distribution as get; print \"Using Arvados SDK version\", get(\"arvados-python-client\").version">&2 2>/dev/null; mkdir -p "/tmp/crunch-job-work" "/tmp/crunch-job-task-work/compute3.1" && if which stdbuf >/dev/null ; then exec stdbuf --output=0 --error=0 \/tmp\/crunch\-job\/src\/crunch_scripts\/hash ; else exec \/tmp\/crunch\-job\/src\/crunch_scripts\/hash ; fi] 2016-03-30_01:08:05 c97qk-8i9sb-etuzc7s3j88iym9 1470 0 stderr invalid value ":/usr/local/bin/crunchrunner" for flag --volume: bad format for volumes: :/usr/local/bin/crunchrunner 2016-03-30_01:08:05 c97qk-8i9sb-etuzc7s3j88iym9 1470 0 stderr See 'docker.io run --help'.
Two issues here:
a) the 'which crunchrunner' lookup is apparently happening on the API server, not on the compute node: the compute node had crunchrunner installed, but the API server did not. I installed it there, and that fixed the problem. Clearly, the test should happen on the compute node.
b) when 'which crunchrunner' doesn't find the executable, we shouldn't try to append half a volume statement to the docker run command, that breaks the invocation (and thus fails the job).
Updated by Peter Amstutz almost 9 years ago
Pushed branch 8828-which-crunchrunner
Updated by Brett Smith almost 9 years ago
- Status changed from New to In Progress
- Assigned To set to Peter Amstutz
- Target version set to 2016-04-13 sprint
Updated by Brett Smith almost 9 years ago
Both the VOLUME_CERTS declarations need to specify that the certs are being mounted at /etc/arvados/ca-certificates.crt
. Otherwise this will break the behavior specified in #8815.
With that fix, this is good to merge, thanks.
(I wonder when we're going to hit limit on the maximum size of a single command line...)
Updated by Peter Amstutz almost 9 years ago
Brett Smith wrote:
Both the VOLUME_CERTS declarations need to specify that the certs are being mounted at
/etc/arvados/ca-certificates.crt
. Otherwise this will break the behavior specified in #8815.With that fix, this is good to merge, thanks.
Whoops, thanks for catching that, that's what I get for rushing. Fixed, tested with arvbox, merged, pushd.
(I wonder when we're going to hit limit on the maximum size of a single command line...)
Updated by Peter Amstutz almost 9 years ago
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|commit:49743c080265b270693154d7a327d0433b0a7dbe.