Project

General

Profile

Actions

Bug #5352

closed

[Crunch] Dispatcher not handling node allocation failures

Added by Abram Connelly about 9 years ago. Updated about 9 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Story points:
0.5

Description

Pipeline instance su92l-d1hrv-wqa83p9taz3tvmf failed for the second job in it's two job pipeline.

From the logs, there is an error that says:

2015-03-02_04:41:12 {"errors":["State invalid change from Failed to Complete"],"error_token":"1425271270+a34d19c7"} at /usr/local/arvados/src/sdk/perl/lib/Arvados/ResourceProxy.pm line 28

Though before it there are errors of the form:

2015-03-02_04:39:57 salloc: error: Unable to allocate resources: Requested nodes are busy
2015-03-02_04:40:37 salloc: Granted job allocation 272
2015-03-02_04:40:37 salloc: error: Unable to allocate resources: Requested nodes are busy
2015-03-02_04:40:38 salloc: error: Unable to allocate resources: Requested nodes are busy

Rerunning the pipeline/jobs has it complete successfully, so this issue looks to be transient.


Subtasks 1 (0 open1 closed)

Task #5670: Review 5352-crunch-dispatch-salloc-tempfail-wipResolvedPeter Amstutz04/05/2015Actions
Actions #1

Updated by Abram Connelly about 9 years ago

For some jobs that I run, the state from workbench shows 'failed' even while the job/pipeline is still running and producing output in the 'Log' pane. I have a job that creates two tasks and it looks like they start to run even after what I assume is the API server knowing the job/pipeline has failed.

Running the same pipeline works if I retry enough times. Some of the failing instances are:

tb05z-d1hrv-doz5n3xtehdyatq
tb05z-d1hrv-q56oz103wv3mxzm
tb05z-d1hrv-hnq3pd7s1sihm6b

This may or may not be related but from the logs, I see errors of the form:

2015-03-03_22:43:11 32688 API method jobs/lock failed: API call /jobs/tb05z-8i9sb-sybu9drx4wqd07b/lock failed: 403 Forbidden {"errors":["#<ArvadosModel::AlreadyLockedError: ArvadosModel::AlreadyLockedError>"],"error_token":"1425422591+3de80ee9"}. Retrying at 2015-03-03T22:43:15Z.
2015-03-03_22:43:15 32688 Error while locking job, exiting 75
2015-03-03_22:43:15 salloc: Relinquishing job allocation 208

I see what look like similar errors for the same job/pipeline that completes successfully.

Actions #2

Updated by Abram Connelly about 9 years ago

It's hard to be sure but I think this issue gets triggered when a job/pipeline fails for some reason and then the next job run after it will automatically be listed as 'failed' even though it looks to be running. I've been running on pipelines instances that invoke the same jobs and seeing this issue crop up, so maybe the fact that they're nearly identical pipelines, maybe only with a Git update or input parameter change that separates them, has something to do with it?

For example, on tb05z, I ran pipeline instance tb05z-d1hrv-s0ri2l1ow825ioc which failed because of bugs in my program. I then re-ran, I believe with only changing inputs and not any of the repositories used in the jobs in the pipeline, and the pipeline instance tb05z-d1hrv-5rvli36mtc5twzf shows 'failed' even though examining the real-time logs through workbench lists it as running. As of this writing, the 'failed' pipeline instance 5rvli... looks to be running correctly.

Actions #3

Updated by Abram Connelly about 9 years ago

Doing a dummy commit ('dirtying' the git repo) and re-running the pipeline looks to work.

Actions #4

Updated by Peter Amstutz about 9 years ago

su92l is running multiple crunch-dispatch --job instances. When there is a race condition and multiple crunch-jobs get started, they are supposed to each try and lock the job, and the losers that can't run the job exit with return code 75. It appears that what is happening is that it is each dispatcher is asking SLURM for which nodes are free, and each dispatcher is trying to allocate the same node. When they run salloc, one of the dispatchers wins, but the other two fail node allocation with "salloc: error: Unable to allocate resources: Requested nodes are busy". The bug is that crunch-dispatch does not detect salloc failure and applies the logic for "job exited without setting job.state to Completed or Failed" which causes the job to marked "Failed".

Actions #5

Updated by Peter Amstutz about 9 years ago

  • Subject changed from Transient pipeline failure to [Crunch] Dispatcher not handling node allocation failures
Actions #6

Updated by Tom Clegg about 9 years ago

  • Target version changed from Bug Triage to 2015-04-29 sprint
Actions #7

Updated by Tom Clegg about 9 years ago

  • Story points set to 0.5
Actions #8

Updated by Tom Clegg about 9 years ago

  • Target version changed from 2015-04-29 sprint to Arvados Future Sprints
Actions #9

Updated by Tom Clegg about 9 years ago

  • Target version changed from Arvados Future Sprints to 2015-04-29 sprint
Actions #10

Updated by Brett Smith about 9 years ago

  • Assigned To set to Brett Smith
Actions #11

Updated by Brett Smith about 9 years ago

  • Status changed from New to In Progress
Actions #12

Updated by Brett Smith about 9 years ago

5352-crunch-dispatch-salloc-tempfail-wip is up for review. Please see the commit message for explanation.

Actions #13

Updated by Brett Smith about 9 years ago

By running this patch in Docker, I successfully reproduced the race condition and saw the changes complete the job. Here are the logs from two simultaneous crunch-dispatch processes:

dispatch: ["git", "--git-dir=/var/lib/arvados/internal.git", "tag", "bcsdk-8i9sb-p7lp5shen37qrib", "eb5da02dd7bb5552bd0594e46d081567d485ad5e"]

dispatch: sudo -E -u crunch PATH=/usr/local/rvm/gems/ruby-2.1.5/bin:/usr/local/rvm/gems/ruby-2.1.5@global/bin:/usr/local/rvm/rubies/ruby-2.1.5/bin:/usr/local/rvm/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/src/arvados/services/crunch PERLLIB=/usr/src/arvados/sdk/perl/lib PYTHONPATH= RUBYLIB=/usr/local/rvm/gems/ruby-2.1.5/gems/bundler-1.9.2/lib GEM_PATH=/usr/local/rvm/gems/ruby-2.1.5:/usr/local/rvm/gems/ruby-2.1.5@global salloc --chdir=/ --immediate --exclusive --no-kill --job-name=bcsdk-8i9sb-p7lp5shen37qrib --nodelist=compute0 /usr/src/arvados/services/crunch/crunch-job --job-api-token 26mw2gzbcaio1boim5pzb7qn6s3um2r6zuyvyo6jviqbrkhh2v --job bcsdk-8i9sb-p7lp5shen37qrib --git-dir /var/lib/arvados/internal.git
dispatch: job bcsdk-8i9sb-p7lp5shen37qrib
dispatch: child 30044 start 2015-04-06_14:04:45
dispatch: update compute0 state to {:state=>"alloc", :job=>"bcsdk-8i9sb-p7lp5shen37qrib"}
dispatch: ["git", "--git-dir=/var/lib/arvados/internal.git", "tag", "bcsdk-8i9sb-p7lp5shen37qrib", "eb5da02dd7bb5552bd0594e46d081567d485ad5e"]
bcsdk-8i9sb-p7lp5shen37qrib ! salloc: Granted job allocation 29

dispatch: sudo -E -u crunch PATH=/usr/local/rvm/gems/ruby-2.1.5/bin:/usr/local/rvm/gems/ruby-2.1.5@global/bin:/usr/local/rvm/rubies/ruby-2.1.5/bin:/usr/local/rvm/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/src/arvados/services/crunch PERLLIB=/usr/src/arvados/sdk/perl/lib PYTHONPATH= RUBYLIB=/usr/local/rvm/gems/ruby-2.1.5/gems/bundler-1.9.2/lib GEM_PATH=/usr/local/rvm/gems/ruby-2.1.5:/usr/local/rvm/gems/ruby-2.1.5@global salloc --chdir=/ --immediate --exclusive --no-kill --job-name=bcsdk-8i9sb-p7lp5shen37qrib --nodelist=compute0 /usr/src/arvados/services/crunch/crunch-job --job-api-token p7945utn6byf0ioqefrivx55r8ailzc68hsifskhisaq43ssp --job bcsdk-8i9sb-p7lp5shen37qrib --git-dir /var/lib/arvados/internal.git
dispatch: job bcsdk-8i9sb-p7lp5shen37qrib
dispatch: child 30062 start 2015-04-06_14:04:45
dispatch: update compute0 state to {:state=>"alloc", :job=>"bcsdk-8i9sb-p7lp5shen37qrib"}
dispatch: child 30062 exit
dispatch: job bcsdk-8i9sb-p7lp5shen37qrib end
bcsdk-8i9sb-p7lp5shen37qrib ! salloc: error: Unable to allocate resources: Requested nodes are busy
bcsdk-8i9sb-p7lp5shen37qrib 30053  running from /usr/src/arvados/sdk/cli/bin/crunch-job with arvados-cli Gem version(s) 0.1.20150323181012, 0.1.20150205181653
bcsdk-8i9sb-p7lp5shen37qrib 30053  check slurm allocation
bcsdk-8i9sb-p7lp5shen37qrib 30053  node compute0 - 1 slots
bcsdk-8i9sb-p7lp5shen37qrib 30053  start
bcsdk-8i9sb-p7lp5shen37qrib 30053  Clean work dirs
bcsdk-8i9sb-p7lp5shen37qrib ! starting: ['srun','--nodelist=compute0','-D','/tmp','bash','-ec','mount -t fuse,fuse.keep | awk \'($3 ~ /\\ykeep\\y/){print $3}\' | xargs -r -n 1 fusermount -u -z; sleep 1; rm -rf $JOB_WORK $CRUNCH_INSTALL $CRUNCH_TMP/task $CRUNCH_TMP/src* $CRUNCH_TMP/*.cid']
bcsdk-8i9sb-p7lp5shen37qrib 30053  Cleanup command exited 0
bcsdk-8i9sb-p7lp5shen37qrib ! starting: ['srun','--nodelist=compute0','/bin/sh','-ec',' if ! /usr/bin/docker.io images -q --no-trunc --all | grep -qxF ae2af9352ae54e27690ab8645e85e501f7ba78a66b20bd7abc2ccbcced74db06; then     arv-get 5814aaff04a7177493c207d16689211f\\+966\\/ae2af9352ae54e27690ab8645e85e501f7ba78a66b20bd7abc2ccbcced74db06\\.tar | /usr/bin/docker.io load fi ']
bcsdk-8i9sb-p7lp5shen37qrib 30053  Looking for version eb5da02dd7bb5552bd0594e46d081567d485ad5e from repository brett/brett
bcsdk-8i9sb-p7lp5shen37qrib 30053  Using local repository '/var/lib/arvados/internal.git'
bcsdk-8i9sb-p7lp5shen37qrib 30053  Version eb5da02dd7bb5552bd0594e46d081567d485ad5e is commit eb5da02dd7bb5552bd0594e46d081567d485ad5e
bcsdk-8i9sb-p7lp5shen37qrib 30053  Run install script on all workers
starting: ['srun','--nodelist=compute0','-D','/tmp','--job-name=bcsdk-8i9sb-p7lp5shen37qrib','sh','-c','mkdir -p /tmp/crunch-job/opt && cd /tmp/crunch-job && perl -']
bcsdk-8i9sb-p7lp5shen37qrib 30053  Install script exited 0
bcsdk-8i9sb-p7lp5shen37qrib 30053  script 4027test.py
bcsdk-8i9sb-p7lp5shen37qrib 30053  script_version eb5da02dd7bb5552bd0594e46d081567d485ad5e
bcsdk-8i9sb-p7lp5shen37qrib 30053  script_parameters {}
bcsdk-8i9sb-p7lp5shen37qrib 30053  runtime_constraints {"max_tasks_per_node":0,"docker_image":"arvados/jobs"}
bcsdk-8i9sb-p7lp5shen37qrib 30053  start level 0
bcsdk-8i9sb-p7lp5shen37qrib 30053  status: 0 done, 0 running, 1 todo
bcsdk-8i9sb-p7lp5shen37qrib 30053 0 job_task bcsdk-ot0gb-oqafub6fvjy29pz
bcsdk-8i9sb-p7lp5shen37qrib 30053 0 child 30502 started on compute0.1
[elide normal job log]
bcsdk-8i9sb-p7lp5shen37qrib 30053 1 child 30556 on compute0.1 exit 0 success=true
bcsdk-8i9sb-p7lp5shen37qrib 30053 1 success in 2 seconds
bcsdk-8i9sb-p7lp5shen37qrib 30053 1 task output (87 bytes): [locator]
bcsdk-8i9sb-p7lp5shen37qrib 30053  wait for last 0 children to finish
bcsdk-8i9sb-p7lp5shen37qrib 30053  status: 2 done, 0 running, 0 todo
bcsdk-8i9sb-p7lp5shen37qrib 30053  release job allocation
bcsdk-8i9sb-p7lp5shen37qrib ! salloc: Job allocation 29 has been revoked.
bcsdk-8i9sb-p7lp5shen37qrib 30053  Freeze not implemented
bcsdk-8i9sb-p7lp5shen37qrib 30053  collate
dispatch: update compute0 state to {:state=>"idle", :job=>nil}
bcsdk-8i9sb-p7lp5shen37qrib 30053  collated output manifest text to send to API server is 224 bytes with access tokens
bcsdk-8i9sb-p7lp5shen37qrib 30053  job output [locator]
dispatch: update compute0 state to {:state=>"idle", :job=>nil}
bcsdk-8i9sb-p7lp5shen37qrib 30053  finish
Collection saved as 'bcsdk-8i9sb-p7lp5shen37qrib.log.txt'
bcsdk-8i9sb-p7lp5shen37qrib 30053  log manifest is [locator]
dispatch: child 30044 exit
dispatch: job bcsdk-8i9sb-p7lp5shen37qrib end

After this, the job is marked successful in the API server.

Actions #14

Updated by Peter Amstutz about 9 years ago

That's convenient that SLURM has a feature to help with exactly this situation.

a9c5882 LGTM

Actions #15

Updated by Brett Smith about 9 years ago

  • Status changed from In Progress to Resolved

Applied in changeset arvados|commit:92052dc0b2f80a16420483a756fcebedc6c6ec3c.

Actions

Also available in: Atom PDF