Bug #5352
closed[Crunch] Dispatcher not handling node allocation failures
Added by Abram Connelly over 9 years ago. Updated over 9 years ago.
Description
Pipeline instance su92l-d1hrv-wqa83p9taz3tvmf failed for the second job in it's two job pipeline.
From the logs, there is an error that says:
2015-03-02_04:41:12 {"errors":["State invalid change from Failed to Complete"],"error_token":"1425271270+a34d19c7"} at /usr/local/arvados/src/sdk/perl/lib/Arvados/ResourceProxy.pm line 28
Though before it there are errors of the form:
2015-03-02_04:39:57 salloc: error: Unable to allocate resources: Requested nodes are busy 2015-03-02_04:40:37 salloc: Granted job allocation 272 2015-03-02_04:40:37 salloc: error: Unable to allocate resources: Requested nodes are busy 2015-03-02_04:40:38 salloc: error: Unable to allocate resources: Requested nodes are busy
Rerunning the pipeline/jobs has it complete successfully, so this issue looks to be transient.
Updated by Abram Connelly over 9 years ago
For some jobs that I run, the state from workbench shows 'failed' even while the job/pipeline is still running and producing output in the 'Log' pane. I have a job that creates two tasks and it looks like they start to run even after what I assume is the API server knowing the job/pipeline has failed.
Running the same pipeline works if I retry enough times. Some of the failing instances are:
tb05z-d1hrv-doz5n3xtehdyatq tb05z-d1hrv-q56oz103wv3mxzm tb05z-d1hrv-hnq3pd7s1sihm6b
This may or may not be related but from the logs, I see errors of the form:
2015-03-03_22:43:11 32688 API method jobs/lock failed: API call /jobs/tb05z-8i9sb-sybu9drx4wqd07b/lock failed: 403 Forbidden {"errors":["#<ArvadosModel::AlreadyLockedError: ArvadosModel::AlreadyLockedError>"],"error_token":"1425422591+3de80ee9"}. Retrying at 2015-03-03T22:43:15Z. 2015-03-03_22:43:15 32688 Error while locking job, exiting 75 2015-03-03_22:43:15 salloc: Relinquishing job allocation 208
I see what look like similar errors for the same job/pipeline that completes successfully.
Updated by Abram Connelly over 9 years ago
It's hard to be sure but I think this issue gets triggered when a job/pipeline fails for some reason and then the next job run after it will automatically be listed as 'failed' even though it looks to be running. I've been running on pipelines instances that invoke the same jobs and seeing this issue crop up, so maybe the fact that they're nearly identical pipelines, maybe only with a Git update or input parameter change that separates them, has something to do with it?
For example, on tb05z, I ran pipeline instance tb05z-d1hrv-s0ri2l1ow825ioc which failed because of bugs in my program. I then re-ran, I believe with only changing inputs and not any of the repositories used in the jobs in the pipeline, and the pipeline instance tb05z-d1hrv-5rvli36mtc5twzf shows 'failed' even though examining the real-time logs through workbench lists it as running. As of this writing, the 'failed' pipeline instance 5rvli... looks to be running correctly.
Updated by Abram Connelly over 9 years ago
Doing a dummy commit ('dirtying' the git repo) and re-running the pipeline looks to work.
Updated by Peter Amstutz over 9 years ago
su92l is running multiple crunch-dispatch --job instances. When there is a race condition and multiple crunch-jobs get started, they are supposed to each try and lock the job, and the losers that can't run the job exit with return code 75. It appears that what is happening is that it is each dispatcher is asking SLURM for which nodes are free, and each dispatcher is trying to allocate the same node. When they run salloc, one of the dispatchers wins, but the other two fail node allocation with "salloc: error: Unable to allocate resources: Requested nodes are busy". The bug is that crunch-dispatch does not detect salloc failure and applies the logic for "job exited without setting job.state to Completed or Failed" which causes the job to marked "Failed".
Updated by Peter Amstutz over 9 years ago
- Subject changed from Transient pipeline failure to [Crunch] Dispatcher not handling node allocation failures
Updated by Tom Clegg over 9 years ago
- Target version changed from Bug Triage to 2015-04-29 sprint
Updated by Tom Clegg over 9 years ago
- Target version changed from 2015-04-29 sprint to Arvados Future Sprints
Updated by Tom Clegg over 9 years ago
- Target version changed from Arvados Future Sprints to 2015-04-29 sprint
Updated by Brett Smith over 9 years ago
5352-crunch-dispatch-salloc-tempfail-wip is up for review. Please see the commit message for explanation.
Updated by Brett Smith over 9 years ago
By running this patch in Docker, I successfully reproduced the race condition and saw the changes complete the job. Here are the logs from two simultaneous crunch-dispatch processes:
dispatch: ["git", "--git-dir=/var/lib/arvados/internal.git", "tag", "bcsdk-8i9sb-p7lp5shen37qrib", "eb5da02dd7bb5552bd0594e46d081567d485ad5e"] dispatch: sudo -E -u crunch PATH=/usr/local/rvm/gems/ruby-2.1.5/bin:/usr/local/rvm/gems/ruby-2.1.5@global/bin:/usr/local/rvm/rubies/ruby-2.1.5/bin:/usr/local/rvm/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/src/arvados/services/crunch PERLLIB=/usr/src/arvados/sdk/perl/lib PYTHONPATH= RUBYLIB=/usr/local/rvm/gems/ruby-2.1.5/gems/bundler-1.9.2/lib GEM_PATH=/usr/local/rvm/gems/ruby-2.1.5:/usr/local/rvm/gems/ruby-2.1.5@global salloc --chdir=/ --immediate --exclusive --no-kill --job-name=bcsdk-8i9sb-p7lp5shen37qrib --nodelist=compute0 /usr/src/arvados/services/crunch/crunch-job --job-api-token 26mw2gzbcaio1boim5pzb7qn6s3um2r6zuyvyo6jviqbrkhh2v --job bcsdk-8i9sb-p7lp5shen37qrib --git-dir /var/lib/arvados/internal.git dispatch: job bcsdk-8i9sb-p7lp5shen37qrib dispatch: child 30044 start 2015-04-06_14:04:45 dispatch: update compute0 state to {:state=>"alloc", :job=>"bcsdk-8i9sb-p7lp5shen37qrib"} dispatch: ["git", "--git-dir=/var/lib/arvados/internal.git", "tag", "bcsdk-8i9sb-p7lp5shen37qrib", "eb5da02dd7bb5552bd0594e46d081567d485ad5e"] bcsdk-8i9sb-p7lp5shen37qrib ! salloc: Granted job allocation 29 dispatch: sudo -E -u crunch PATH=/usr/local/rvm/gems/ruby-2.1.5/bin:/usr/local/rvm/gems/ruby-2.1.5@global/bin:/usr/local/rvm/rubies/ruby-2.1.5/bin:/usr/local/rvm/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/src/arvados/services/crunch PERLLIB=/usr/src/arvados/sdk/perl/lib PYTHONPATH= RUBYLIB=/usr/local/rvm/gems/ruby-2.1.5/gems/bundler-1.9.2/lib GEM_PATH=/usr/local/rvm/gems/ruby-2.1.5:/usr/local/rvm/gems/ruby-2.1.5@global salloc --chdir=/ --immediate --exclusive --no-kill --job-name=bcsdk-8i9sb-p7lp5shen37qrib --nodelist=compute0 /usr/src/arvados/services/crunch/crunch-job --job-api-token p7945utn6byf0ioqefrivx55r8ailzc68hsifskhisaq43ssp --job bcsdk-8i9sb-p7lp5shen37qrib --git-dir /var/lib/arvados/internal.git dispatch: job bcsdk-8i9sb-p7lp5shen37qrib dispatch: child 30062 start 2015-04-06_14:04:45 dispatch: update compute0 state to {:state=>"alloc", :job=>"bcsdk-8i9sb-p7lp5shen37qrib"} dispatch: child 30062 exit dispatch: job bcsdk-8i9sb-p7lp5shen37qrib end bcsdk-8i9sb-p7lp5shen37qrib ! salloc: error: Unable to allocate resources: Requested nodes are busy bcsdk-8i9sb-p7lp5shen37qrib 30053 running from /usr/src/arvados/sdk/cli/bin/crunch-job with arvados-cli Gem version(s) 0.1.20150323181012, 0.1.20150205181653 bcsdk-8i9sb-p7lp5shen37qrib 30053 check slurm allocation bcsdk-8i9sb-p7lp5shen37qrib 30053 node compute0 - 1 slots bcsdk-8i9sb-p7lp5shen37qrib 30053 start bcsdk-8i9sb-p7lp5shen37qrib 30053 Clean work dirs bcsdk-8i9sb-p7lp5shen37qrib ! starting: ['srun','--nodelist=compute0','-D','/tmp','bash','-ec','mount -t fuse,fuse.keep | awk \'($3 ~ /\\ykeep\\y/){print $3}\' | xargs -r -n 1 fusermount -u -z; sleep 1; rm -rf $JOB_WORK $CRUNCH_INSTALL $CRUNCH_TMP/task $CRUNCH_TMP/src* $CRUNCH_TMP/*.cid'] bcsdk-8i9sb-p7lp5shen37qrib 30053 Cleanup command exited 0 bcsdk-8i9sb-p7lp5shen37qrib ! starting: ['srun','--nodelist=compute0','/bin/sh','-ec',' if ! /usr/bin/docker.io images -q --no-trunc --all | grep -qxF ae2af9352ae54e27690ab8645e85e501f7ba78a66b20bd7abc2ccbcced74db06; then arv-get 5814aaff04a7177493c207d16689211f\\+966\\/ae2af9352ae54e27690ab8645e85e501f7ba78a66b20bd7abc2ccbcced74db06\\.tar | /usr/bin/docker.io load fi '] bcsdk-8i9sb-p7lp5shen37qrib 30053 Looking for version eb5da02dd7bb5552bd0594e46d081567d485ad5e from repository brett/brett bcsdk-8i9sb-p7lp5shen37qrib 30053 Using local repository '/var/lib/arvados/internal.git' bcsdk-8i9sb-p7lp5shen37qrib 30053 Version eb5da02dd7bb5552bd0594e46d081567d485ad5e is commit eb5da02dd7bb5552bd0594e46d081567d485ad5e bcsdk-8i9sb-p7lp5shen37qrib 30053 Run install script on all workers starting: ['srun','--nodelist=compute0','-D','/tmp','--job-name=bcsdk-8i9sb-p7lp5shen37qrib','sh','-c','mkdir -p /tmp/crunch-job/opt && cd /tmp/crunch-job && perl -'] bcsdk-8i9sb-p7lp5shen37qrib 30053 Install script exited 0 bcsdk-8i9sb-p7lp5shen37qrib 30053 script 4027test.py bcsdk-8i9sb-p7lp5shen37qrib 30053 script_version eb5da02dd7bb5552bd0594e46d081567d485ad5e bcsdk-8i9sb-p7lp5shen37qrib 30053 script_parameters {} bcsdk-8i9sb-p7lp5shen37qrib 30053 runtime_constraints {"max_tasks_per_node":0,"docker_image":"arvados/jobs"} bcsdk-8i9sb-p7lp5shen37qrib 30053 start level 0 bcsdk-8i9sb-p7lp5shen37qrib 30053 status: 0 done, 0 running, 1 todo bcsdk-8i9sb-p7lp5shen37qrib 30053 0 job_task bcsdk-ot0gb-oqafub6fvjy29pz bcsdk-8i9sb-p7lp5shen37qrib 30053 0 child 30502 started on compute0.1 [elide normal job log] bcsdk-8i9sb-p7lp5shen37qrib 30053 1 child 30556 on compute0.1 exit 0 success=true bcsdk-8i9sb-p7lp5shen37qrib 30053 1 success in 2 seconds bcsdk-8i9sb-p7lp5shen37qrib 30053 1 task output (87 bytes): [locator] bcsdk-8i9sb-p7lp5shen37qrib 30053 wait for last 0 children to finish bcsdk-8i9sb-p7lp5shen37qrib 30053 status: 2 done, 0 running, 0 todo bcsdk-8i9sb-p7lp5shen37qrib 30053 release job allocation bcsdk-8i9sb-p7lp5shen37qrib ! salloc: Job allocation 29 has been revoked. bcsdk-8i9sb-p7lp5shen37qrib 30053 Freeze not implemented bcsdk-8i9sb-p7lp5shen37qrib 30053 collate dispatch: update compute0 state to {:state=>"idle", :job=>nil} bcsdk-8i9sb-p7lp5shen37qrib 30053 collated output manifest text to send to API server is 224 bytes with access tokens bcsdk-8i9sb-p7lp5shen37qrib 30053 job output [locator] dispatch: update compute0 state to {:state=>"idle", :job=>nil} bcsdk-8i9sb-p7lp5shen37qrib 30053 finish Collection saved as 'bcsdk-8i9sb-p7lp5shen37qrib.log.txt' bcsdk-8i9sb-p7lp5shen37qrib 30053 log manifest is [locator] dispatch: child 30044 exit dispatch: job bcsdk-8i9sb-p7lp5shen37qrib end
After this, the job is marked successful in the API server.
Updated by Peter Amstutz over 9 years ago
That's convenient that SLURM has a feature to help with exactly this situation.
a9c5882 LGTM
Updated by Brett Smith over 9 years ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|commit:92052dc0b2f80a16420483a756fcebedc6c6ec3c.