Project

General

Profile

Actions

Bug #11209

closed

stuck keep fuse mounts not cleared by crunch-job

Added by Joshua Randall about 7 years ago. Updated almost 7 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
FUSE
Target version:
-
Story points:
-

Description

crunch-job attempts to unmount any fuse filesystems that are mounted under $CRUNCH_TMP but it attempts to do so only using fusermount. Often on our system, this fails and a "umount -f <mount_point>" is required to make the node work again.

In addition, this often happens on multiple nodes at the same time - and by the time we have three nodes with wedged fuse mounts, they will rapidly fail all pending jobs. There seems to be no mechanism by which crunch dispatch can decide to stop trying to dispatch to a node that is broken.

Here is the log from a job that suffered from this issue.

dispatching job z8ta6-8i9sb-8mp2qww92moa644 {"docker_image"=>"mercury/gatk-3.5", "min_nodes"=>1, "max_tasks_per_node"=>10, "keep_cache_mb_per_task"=>1280} to humgen-05-07 z8ta6-7ekkf-sa1q59632vhxov6 {"total_cpu_cores":32,"total_ram_mb":257867,"total_scratch_mb":788561}
2017-02-28_17:23:33 salloc: Granted job allocation 17536
2017-02-28_17:23:33 58397  Sanity check is `/usr/bin/docker ps -q`
2017-02-28_17:23:33 58397  sanity check: start
2017-02-28_17:23:33 58397  stderr starting: ['srun','--nodes=1','--ntasks-per-node=1','/usr/bin/docker','ps','-q']
2017-02-28_17:23:33 58397  sanity check: exit 0
2017-02-28_17:23:33 58397  Sanity check OK
2017-02-28_17:23:33 z8ta6-8i9sb-8mp2qww92moa644 58397  running from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/arvados-cli-0.1.20170217221854/bin/crunch-job with arvados-cli Gem version(s) 0.1.20170217221854, 0.1.20161017193526, 0.1.20160503204200, 0.1.20151207150126, 0.1.20151023190001
2017-02-28_17:23:33 z8ta6-8i9sb-8mp2qww92moa644 58397  check slurm allocation
2017-02-28_17:23:33 z8ta6-8i9sb-8mp2qww92moa644 58397  node humgen-05-07 - 10 slots
2017-02-28_17:23:33 z8ta6-8i9sb-8mp2qww92moa644 58397  start
2017-02-28_17:23:34 z8ta6-8i9sb-8mp2qww92moa644 58397  clean work dirs: start
2017-02-28_17:23:34 z8ta6-8i9sb-8mp2qww92moa644 58397  stderr starting: ['srun','--nodelist=humgen-05-07','-D','/data/crunch-tmp','bash','-ec','-o','pipefail','mount -t fuse,fuse.keep | awk "(index(\\$3, \\"$CRUNCH_TMP\\") == 1){print \\$3}" | xargs -r -n 1 fusermount -u -z; sleep 1; rm -rf $JOB_WORK $CRUNCH_INSTALL $CRUNCH_TMP/task $CRUNCH_TMP/src* $CRUNCH_TMP/*.cid']
2017-02-28_17:23:34 z8ta6-8i9sb-8mp2qww92moa644 58397  stderr fusermount: failed to unmount /data/crunch-tmp/crunch-job/task/humgen-05-07.10.keep: Invalid argument
2017-02-28_17:23:34 z8ta6-8i9sb-8mp2qww92moa644 58397  stderr srun: error: humgen-05-07: task 0: Exited with exit code 123
2017-02-28_17:23:34 z8ta6-8i9sb-8mp2qww92moa644 58397  clean work dirs: exit 123
2017-02-28_17:23:34 salloc: Relinquishing job allocation 17536
dispatching job z8ta6-8i9sb-8mp2qww92moa644 {"docker_image"=>"mercury/gatk-3.5", "min_nodes"=>1, "max_tasks_per_node"=>10, "keep_cache_mb_per_task"=>1280} to humgen-04-02 z8ta6-7ekkf-ekzlxvozts92sqm {"total_cpu_cores":40,"total_ram_mb":193289,"total_scratch_mb":68302106}
2017-02-28_17:23:35 salloc: error: Unable to allocate resources: Requested nodes are busy
2017-02-28_17:23:35 salloc: Job allocation 17539 has been revoked.
dispatching job z8ta6-8i9sb-8mp2qww92moa644 {"docker_image"=>"mercury/gatk-3.5", "min_nodes"=>1, "max_tasks_per_node"=>10, "keep_cache_mb_per_task"=>1280} to humgen-05-03 z8ta6-7ekkf-1i1v5zotflg26jn {"total_cpu_cores":32,"total_ram_mb":257867,"total_scratch_mb":788561}
2017-02-28_17:23:36 salloc: Granted job allocation 17540
2017-02-28_17:23:36 58715  Sanity check is `/usr/bin/docker ps -q`
2017-02-28_17:23:36 58715  sanity check: start
2017-02-28_17:23:36 58715  stderr starting: ['srun','--nodes=1','--ntasks-per-node=1','/usr/bin/docker','ps','-q']
2017-02-28_17:23:36 58715  sanity check: exit 0
2017-02-28_17:23:36 58715  Sanity check OK
2017-02-28_17:23:38 z8ta6-8i9sb-8mp2qww92moa644 58715  running from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/arvados-cli-0.1.20170217221854/bin/crunch-job with arvados-cli Gem version(s) 0.1.20170217221854, 0.1.20161017193526, 0.1.20160503204200, 0.1.20151207150126, 0.1.20151023190001
2017-02-28_17:23:38 z8ta6-8i9sb-8mp2qww92moa644 58715  check slurm allocation
2017-02-28_17:23:38 z8ta6-8i9sb-8mp2qww92moa644 58715  node humgen-05-03 - 10 slots
2017-02-28_17:23:38 z8ta6-8i9sb-8mp2qww92moa644 58715  start
2017-02-28_17:23:38 z8ta6-8i9sb-8mp2qww92moa644 58715  clean work dirs: start
2017-02-28_17:23:38 z8ta6-8i9sb-8mp2qww92moa644 58715  stderr starting: ['srun','--nodelist=humgen-05-03','-D','/data/crunch-tmp','bash','-ec','-o','pipefail','mount -t fuse,fuse.keep | awk "(index(\\$3, \\"$CRUNCH_TMP\\") == 1){print \\$3}" | xargs -r -n 1 fusermount -u -z; sleep 1; rm -rf $JOB_WORK $CRUNCH_INSTALL $CRUNCH_TMP/task $CRUNCH_TMP/src* $CRUNCH_TMP/*.cid']
2017-02-28_17:23:38 z8ta6-8i9sb-8mp2qww92moa644 58715  stderr fusermount: failed to unmount /data/crunch-tmp/crunch-job/task/humgen-05-03.4.keep: Invalid argument
2017-02-28_17:23:38 z8ta6-8i9sb-8mp2qww92moa644 58715  stderr srun: error: humgen-05-03: task 0: Exited with exit code 123
2017-02-28_17:23:38 z8ta6-8i9sb-8mp2qww92moa644 58715  clean work dirs: exit 123
2017-02-28_17:23:38 salloc: Relinquishing job allocation 17540
2017-02-28_17:23:38 close failed in file object destructor:
2017-02-28_17:23:38 sys.excepthook is missing
2017-02-28_17:23:38 lost sys.stderr
dispatching job z8ta6-8i9sb-8mp2qww92moa644 {"docker_image"=>"mercury/gatk-3.5", "min_nodes"=>1, "max_tasks_per_node"=>10, "keep_cache_mb_per_task"=>1280} to humgen-04-02 z8ta6-7ekkf-ekzlxvozts92sqm {"total_cpu_cores":40,"total_ram_mb":193289,"total_scratch_mb":68302106}
2017-02-28_17:23:40 salloc: Granted job allocation 17544
2017-02-28_17:23:40 58985  Sanity check is `/usr/bin/docker ps -q`
2017-02-28_17:23:40 58985  sanity check: start
2017-02-28_17:23:40 58985  stderr starting: ['srun','--nodes=1','--ntasks-per-node=1','/usr/bin/docker','ps','-q']
2017-02-28_17:23:40 58985  sanity check: exit 0
2017-02-28_17:23:40 58985  Sanity check OK
2017-02-28_17:23:41 z8ta6-8i9sb-8mp2qww92moa644 58985  running from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/arvados-cli-0.1.20170217221854/bin/crunch-job with arvados-cli Gem version(s) 0.1.20170217221854, 0.1.20161017193526, 0.1.20160503204200, 0.1.20151207150126, 0.1.20151023190001
2017-02-28_17:23:41 z8ta6-8i9sb-8mp2qww92moa644 58985  check slurm allocation
2017-02-28_17:23:41 z8ta6-8i9sb-8mp2qww92moa644 58985  node humgen-04-02 - 10 slots
2017-02-28_17:23:41 z8ta6-8i9sb-8mp2qww92moa644 58985  start
2017-02-28_17:23:41 z8ta6-8i9sb-8mp2qww92moa644 58985  clean work dirs: start
2017-02-28_17:23:41 z8ta6-8i9sb-8mp2qww92moa644 58985  stderr starting: ['srun','--nodelist=humgen-04-02','-D','/data/crunch-tmp','bash','-ec','-o','pipefail','mount -t fuse,fuse.keep | awk "(index(\\$3, \\"$CRUNCH_TMP\\") == 1){print \\$3}" | xargs -r -n 1 fusermount -u -z; sleep 1; rm -rf $JOB_WORK $CRUNCH_INSTALL $CRUNCH_TMP/task $CRUNCH_TMP/src* $CRUNCH_TMP/*.cid']
2017-02-28_17:23:41 z8ta6-8i9sb-8mp2qww92moa644 58985  stderr fusermount: failed to unmount /data/crunch-tmp/crunch-job/task/humgen-04-02.9.keep: Invalid argument
2017-02-28_17:23:41 z8ta6-8i9sb-8mp2qww92moa644 58985  stderr srun: error: humgen-04-02: task 0: Exited with exit code 123
2017-02-28_17:23:41 z8ta6-8i9sb-8mp2qww92moa644 58985  clean work dirs: exit 123
2017-02-28_17:23:41 salloc: Relinquishing job allocation 17544
2017-02-28_17:23:41 close failed in file object destructor:
2017-02-28_17:23:41 sys.excepthook is missing
2017-02-28_17:23:41 lost sys.stderr


Subtasks 6 (0 open6 closed)

Task #11377: Honor subtype argResolvedTom Clegg03/02/2017Actions
Task #11378: Warn that most users don't want --unmount-allResolvedTom Clegg03/02/2017Actions
Task #11292: Review 11209-unmount-replaceResolvedLucas Di Pentima03/02/2017Actions
Task #11376: Review 11209-unmount-subtypeResolvedLucas Di Pentima03/02/2017Actions
Task #11353: use arv-mount --unmount-all in crunch-jobResolvedTom Clegg03/02/2017Actions
Task #11504: review 11209-crunch-unmount-allResolvedLucas Di Pentima03/02/2017Actions
Actions

Also available in: Atom PDF