Bug #5046

Jobs failing to start. Logs show "rm:cannot remove `/tmp/crunch-job/task/compute19.1.keep': Is a directory"

Added by Abram Connelly over 5 years ago. Updated over 5 years ago.

Status:
Closed
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
-
Start date:
01/21/2015
Due date:
% Done:

0%

Estimated time:
Story points:
-

Description

On qr1hi, pipeline instance qr1hi-d1hrv-y4250viwnk8z966 fails quickly after being registered to start. From the logs, here are some of the first errors it looks to have encountered:

2015-01-21_23:40:34 starting: ['srun','--nodelist=compute4,compute16,compute18,compute19,compute28,compute29,compute43,compute48','-D','/tmp','bash','-c','if mount | grep -q $JOB_WORK/; then for i in $JOB_WORK/*keep $CRUNCH_TMP/task/*.keep; do /bin/fusermount -z -u $i; done; fi; sleep 1; rm -rf $JOB_WORK $CRUNCH_INSTALL $CRUNCH_TMP/task $CRUNCH_TMP/src*']
2015-01-21_23:40:35 rm: cannot remove `/tmp/crunch-job/task/compute4.1.keep': Is a directory
2015-01-21_23:40:35 rm: cannot remove `/tmp/crunch-job/task/compute28.1.keep': Is a directory
2015-01-21_23:40:35 rm: cannot remove `/tmp/crunch-job/task/compute43.1.keep': Is a directory
2015-01-21_23:40:35 rm: cannot remove `/tmp/crunch-job/task/compute18.1.keep': Is a directory
2015-01-21_23:40:35 rm: cannot remove `/tmp/crunch-job/task/compute29.1.keep': Is a directory
2015-01-21_23:40:35 rm: cannot remove `/tmp/crunch-job/task/compute48.1.keep': Is a directory
2015-01-21_23:40:35 rm: cannot remove `/tmp/crunch-job/task/compute16.1.keep': Is a directory
2015-01-21_23:40:35 rm: cannot remove `/tmp/crunch-job/task/compute19.1.keep': Is a directory
2015-01-21_23:40:35 srun: error: compute4: task 0: Exited with exit code 1

Log file manifest is 97af21f7764d3eafa33b58bf186879d9+85.


Related issues

Is duplicate of Arvados - Bug #4967: [Crunch] Doesn't cope well with FUSE mounts left hanging around after killing tasks with SIGKILLResolved01/21/2015

History

#1 Updated by Brett Smith over 5 years ago

This is a duplicate of #4967. (The symptoms are more like #4970, but they have the same root cause.) We already have a fix pushed to git master, and we're working on deploying it, but there are unrelated obstacles preventing that. But I'll let you know when it's done.

In the meantime, this issue arises when a job is canceled and has to be ended forcefully. After that happens, the compute node is messed up for following jobs. But as long as the coast is clear when you start your job, you shouldn't run into this. If you let me know when you're ready to run it, I can check the compute nodes and make sure they're clean and give you the green light to start, so you can at least make progress without waiting on our deploy.

#2 Updated by Brett Smith over 5 years ago

  • Status changed from New to Closed
  • Target version deleted (Bug Triage)

qr1hi is deployed with the bugfix.

Also available in: Atom PDF