Bug #5046
closedJobs failing to start. Logs show "rm:cannot remove `/tmp/crunch-job/task/compute19.1.keep': Is a directory"
Description
On qr1hi, pipeline instance qr1hi-d1hrv-y4250viwnk8z966 fails quickly after being registered to start. From the logs, here are some of the first errors it looks to have encountered:
2015-01-21_23:40:34 starting: ['srun','--nodelist=compute4,compute16,compute18,compute19,compute28,compute29,compute43,compute48','-D','/tmp','bash','-c','if mount | grep -q $JOB_WORK/; then for i in $JOB_WORK/*keep $CRUNCH_TMP/task/*.keep; do /bin/fusermount -z -u $i; done; fi; sleep 1; rm -rf $JOB_WORK $CRUNCH_INSTALL $CRUNCH_TMP/task $CRUNCH_TMP/src*'] 2015-01-21_23:40:35 rm: cannot remove `/tmp/crunch-job/task/compute4.1.keep': Is a directory 2015-01-21_23:40:35 rm: cannot remove `/tmp/crunch-job/task/compute28.1.keep': Is a directory 2015-01-21_23:40:35 rm: cannot remove `/tmp/crunch-job/task/compute43.1.keep': Is a directory 2015-01-21_23:40:35 rm: cannot remove `/tmp/crunch-job/task/compute18.1.keep': Is a directory 2015-01-21_23:40:35 rm: cannot remove `/tmp/crunch-job/task/compute29.1.keep': Is a directory 2015-01-21_23:40:35 rm: cannot remove `/tmp/crunch-job/task/compute48.1.keep': Is a directory 2015-01-21_23:40:35 rm: cannot remove `/tmp/crunch-job/task/compute16.1.keep': Is a directory 2015-01-21_23:40:35 rm: cannot remove `/tmp/crunch-job/task/compute19.1.keep': Is a directory 2015-01-21_23:40:35 srun: error: compute4: task 0: Exited with exit code 1
Log file manifest is 97af21f7764d3eafa33b58bf186879d9+85.
Related issues
Updated by Brett Smith over 9 years ago
This is a duplicate of #4967. (The symptoms are more like #4970, but they have the same root cause.) We already have a fix pushed to git master, and we're working on deploying it, but there are unrelated obstacles preventing that. But I'll let you know when it's done.
In the meantime, this issue arises when a job is canceled and has to be ended forcefully. After that happens, the compute node is messed up for following jobs. But as long as the coast is clear when you start your job, you shouldn't run into this. If you let me know when you're ready to run it, I can check the compute nodes and make sure they're clean and give you the green light to start, so you can at least make progress without waiting on our deploy.
Updated by Brett Smith over 9 years ago
- Status changed from New to Closed
- Target version deleted (
Bug Triage)
qr1hi is deployed with the bugfix.