Project

General

Profile

Actions

Bug #6592

closed

[Crunch] crunch-job should handle cleanup step failures like install step failures

Added by Brett Smith almost 9 years ago. Updated over 8 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Story points:
1.0

Description

A compute node was running a Keep mount for a non-Crunch user on the system. When work was dispatched to this system, the cleanup step would try to unmount this mount, and fail. This meant that the tempdir cleanup (rm -rf) later in that same step did not run—but Crunch proceeded to run the job, leading to unpredictable results.

Apply the same retry-and-fail strategy that we use for the install step to the cleanup step.


Subtasks 3 (0 open3 closed)

Task #6733: Review 6592-retry-if-cleanupfailResolvedPeter Amstutz07/13/2015Actions
Task #6748: Exit RETRY_UNLOCKED if cleanup failsResolvedTom Clegg07/13/2015Actions
Task #6738: Test caseResolvedTom Clegg07/13/2015Actions
Actions #1

Updated by Brett Smith almost 9 years ago

  • Target version changed from 2015-08-19 sprint to 2015-08-05 sprint
Actions #2

Updated by Tom Clegg almost 9 years ago

  • Story points set to 1.0
Actions #3

Updated by Tom Clegg almost 9 years ago

  • Assigned To set to Tom Clegg
Actions #4

Updated by Tom Clegg over 8 years ago

Additional changes in 6592-retry-if-cleanupfail besides the obvious thing:
  • Use set -o pipefail for the mount | awk | xargs fusermount command line. I noticed (by stubbing mount with "exit 1") that mount or awk failure was being counted as "nothing to unmount".
  • Don't hardcode /usr/bin/docker.io. This makes it hard to mock. Probably makes it harder to integrate with arbitrary docker hosts, too?
  • Wrote some crunch-job integration tests, in sdk/cli. They don't get as far as running a successful job but they exercise a few different failure modes near the top, including the one we're creating now. And prevent us from passing jenkins with syntax errors.
  • Un-skipped a bunch of existing tests that were skipped because they required API server to be running at jenkins time, because we do that now.
Actions #5

Updated by Tom Clegg over 8 years ago

Updated to install the Perl SDK and check its dependencies in run-tests.sh, instead of loading Arvados.pm from the source tree in the crunch-job tests.

Now at 52c8fd3 (arvados 6592-retry-if-cleanupfail) and arvados-dev|eafd033 (arvados-dev 6592-test-perl).

Actions #6

Updated by Peter Amstutz over 8 years ago

  • Status changed from New to In Progress
Actions #7

Updated by Peter Amstutz over 8 years ago

CLI tests using Perl pass for me now (yes!), the rest of it looks good to me.

Actions #8

Updated by Tom Clegg over 8 years ago

  • Status changed from In Progress to Resolved

Applied in changeset arvados|commit:6dff0705fd3b4e0acde7bdf5821ef115ba74099b.

Actions

Also available in: Atom PDF