Bug #6592
closed
[Crunch] crunch-job should handle cleanup step failures like install step failures
Added by Brett Smith over 9 years ago.
Updated over 9 years ago.
Description
A compute node was running a Keep mount for a non-Crunch user on the system. When work was dispatched to this system, the cleanup step would try to unmount this mount, and fail. This meant that the tempdir cleanup (rm -rf
) later in that same step did not run—but Crunch proceeded to run the job, leading to unpredictable results.
Apply the same retry-and-fail strategy that we use for the install step to the cleanup step.
- Target version changed from 2015-08-19 sprint to 2015-08-05 sprint
- Assigned To set to Tom Clegg
Additional changes in 6592-retry-if-cleanupfail besides the obvious thing:
- Use
set -o pipefail
for the mount | awk | xargs fusermount
command line. I noticed (by stubbing mount with "exit 1") that mount or awk failure was being counted as "nothing to unmount".
- Don't hardcode
/usr/bin/docker.io
. This makes it hard to mock. Probably makes it harder to integrate with arbitrary docker hosts, too?
- Wrote some crunch-job integration tests, in sdk/cli. They don't get as far as running a successful job but they exercise a few different failure modes near the top, including the one we're creating now. And prevent us from passing jenkins with syntax errors.
- Un-skipped a bunch of existing tests that were skipped because they required API server to be running at jenkins time, because we do that now.
Updated to install the Perl SDK and check its dependencies in run-tests.sh, instead of loading Arvados.pm from the source tree in the crunch-job tests.
Now at 52c8fd3 (arvados 6592-retry-if-cleanupfail) and arvados-dev|eafd033 (arvados-dev 6592-test-perl).
- Status changed from New to In Progress
CLI tests using Perl pass for me now (yes!), the rest of it looks good to me.
- Status changed from In Progress to Resolved
Applied in changeset arvados|commit:6dff0705fd3b4e0acde7bdf5821ef115ba74099b.
Also available in: Atom
PDF