Bug #7227
closed[Crunch] Fails when other users have Keep mounts on the system
Description
At the beginning of crunch-job, during the cleanup step, crunch-job finds all Keep mounts and tries to unmount them. The code expects that all the Keep mounts on the system belong to past crunch-job processes on the compute node. However, this is not always true:
- On compute nodes, sometimes administrators make a Keep mount for debugging purposes.
- On shell nodes, when someone is using crunch-job in local run mode, other users on the system may have Keep mounts (and the user's own Keep mount shouldn't be unmounted anyway).
Tighten the criteria for mounts to unmount to prevent these issues.
Updated by Brett Smith over 9 years ago
This bug was noticed by a user who was trying to follow the tutorial, and run Crunch locally. After you fix this bug, please test local job running and see whether it works. If there are other bugs that prevent it, we should probably have a discussion about how to proceed with talking about local job running in the docs.
Updated by Brett Smith over 9 years ago
I think it should be sufficient to extend the awk regular expression to only find Keep mounts under $ENV{CRUNCH_TMP}.
Updated by Brett Smith over 9 years ago
#4967 intentionally adopted a broader unmount strategy to avoid a bug that happened when we moved the mount from $JOB_WORK to $TASK_WORK. I still think unmounting everything under $ENV{CRUNCH_TMP} is the right strategy, though, because obviously #4967 caused a feature regression, making it difficult to run crunch-job locally under some circumstances; and unmounting everything under $ENV{CRUNCH_TMP} is still broader than what we had pre-#4967, and it's hard to imagine when crunch-job would make a mount outside that root.
Updated by Brett Smith over 9 years ago
- Target version changed from Arvados Future Sprints to 2015-09-30 sprint
Updated by Brett Smith over 9 years ago
- Target version changed from 2015-09-30 sprint to Arvados Future Sprints
Updated by Brett Smith over 9 years ago
- Story points set to 0.5
.5 points to implement the awk fix.
You can test this by running in local mode as described in note-1. This unmount happens very early in the script, so as long as you can run your version on a shell node and don't see it unmount other users' FUSE directories, that's good.
Updated by Brett Smith about 9 years ago
- Target version changed from Arvados Future Sprints to 2015-09-30 sprint
7227-crunch-job-stricter-unmount-wip is up for review. It changes the awk expression to return mount paths that begin with $CRUNCH_TMP. It does this with an index check, because that seemed easier and safer than trying to figure out how to escape regular expression characters from $CRUNCH_TMP for awk through bash and perl.
I tested this change by trying to run a script locally on a shell node where other users had Keep mounts. The cleanup step succeeded. If I manually mounted Keep under $CRUNCH_TMP beforehand, the cleanup step correctly unmounted it.
The local run still failed because the cluster defines a default Docker image, and I did not have permission to run Docker on the shell node. I'm not sure whether that should be fixed at the crunch-job level, or if it's a sort of policy/deployment issue, or what. It definitely puts a cramp on walking through the tutorial on our normal clusters. But that can be dealt with separately.
Updated by Brett Smith about 9 years ago
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|commit:73ce0cf7675e060d33e75488edfa4f533c177f82.