Project

General

Profile

Actions

Bug #7227

closed

[Crunch] Fails when other users have Keep mounts on the system

Added by Brett Smith over 9 years ago. Updated about 9 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Story points:
0.5

Description

At the beginning of crunch-job, during the cleanup step, crunch-job finds all Keep mounts and tries to unmount them. The code expects that all the Keep mounts on the system belong to past crunch-job processes on the compute node. However, this is not always true:

  • On compute nodes, sometimes administrators make a Keep mount for debugging purposes.
  • On shell nodes, when someone is using crunch-job in local run mode, other users on the system may have Keep mounts (and the user's own Keep mount shouldn't be unmounted anyway).

Tighten the criteria for mounts to unmount to prevent these issues.


Subtasks 1 (0 open1 closed)

Task #7376: Review 7227-crunch-job-stricter-unmount-wipResolvedBrett Smith09/23/2015Actions

Related issues 1 (0 open1 closed)

Has duplicate Arvados - Task #7592: arv crunch job causes weird exitClosed10/16/2015Actions
Actions #1

Updated by Brett Smith over 9 years ago

This bug was noticed by a user who was trying to follow the tutorial, and run Crunch locally. After you fix this bug, please test local job running and see whether it works. If there are other bugs that prevent it, we should probably have a discussion about how to proceed with talking about local job running in the docs.

Actions #2

Updated by Brett Smith over 9 years ago

I think it should be sufficient to extend the awk regular expression to only find Keep mounts under $ENV{CRUNCH_TMP}.

Actions #3

Updated by Brett Smith over 9 years ago

#4967 intentionally adopted a broader unmount strategy to avoid a bug that happened when we moved the mount from $JOB_WORK to $TASK_WORK. I still think unmounting everything under $ENV{CRUNCH_TMP} is the right strategy, though, because obviously #4967 caused a feature regression, making it difficult to run crunch-job locally under some circumstances; and unmounting everything under $ENV{CRUNCH_TMP} is still broader than what we had pre-#4967, and it's hard to imagine when crunch-job would make a mount outside that root.

Actions #4

Updated by Brett Smith over 9 years ago

  • Target version changed from Arvados Future Sprints to 2015-09-30 sprint
Actions #5

Updated by Brett Smith over 9 years ago

  • Target version changed from 2015-09-30 sprint to Arvados Future Sprints
Actions #6

Updated by Brett Smith over 9 years ago

  • Story points set to 0.5

.5 points to implement the awk fix.

You can test this by running in local mode as described in note-1. This unmount happens very early in the script, so as long as you can run your version on a shell node and don't see it unmount other users' FUSE directories, that's good.

Actions #7

Updated by Brett Smith about 9 years ago

  • Assigned To set to Brett Smith
Actions #8

Updated by Brett Smith about 9 years ago

  • Target version changed from Arvados Future Sprints to 2015-09-30 sprint

7227-crunch-job-stricter-unmount-wip is up for review. It changes the awk expression to return mount paths that begin with $CRUNCH_TMP. It does this with an index check, because that seemed easier and safer than trying to figure out how to escape regular expression characters from $CRUNCH_TMP for awk through bash and perl.

I tested this change by trying to run a script locally on a shell node where other users had Keep mounts. The cleanup step succeeded. If I manually mounted Keep under $CRUNCH_TMP beforehand, the cleanup step correctly unmounted it.

The local run still failed because the cluster defines a default Docker image, and I did not have permission to run Docker on the shell node. I'm not sure whether that should be fixed at the crunch-job level, or if it's a sort of policy/deployment issue, or what. It definitely puts a cramp on walking through the tutorial on our normal clusters. But that can be dealt with separately.

Actions #9

Updated by Brett Smith about 9 years ago

  • Status changed from New to In Progress
Actions #10

Updated by Peter Amstutz about 9 years ago

Looks good to me.

Actions #11

Updated by Brett Smith about 9 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:73ce0cf7675e060d33e75488edfa4f533c177f82.

Actions

Also available in: Atom PDF