Project

General

Profile

Actions

Bug #9924

closed

[Crunch] Recover from lost slurm allocation

Added by Peter Amstutz over 7 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Story points:
-

Description

If slurm thinks that a node has failed, it may revoke crunch-job's allocation. When this happens, the check_squeue() feature of crunch-job may detect it as a "tempfail" but it is impossible for crunch-job to recover:

2016-08-31_23:02:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 notice: task is not in slurm queue but srun process 28336 has not exited
2016-08-31_23:03:14 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 killing orphaned srun process 28336 (task not in slurm queue, no stderr received in last 51s)
2016-08-31_23:03:14 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 sending 2x signal 2 to pid 28336
2016-08-31_23:03:14 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 stderr srun: interrupt (one more within 1 sec to abort)
2016-08-31_23:03:14 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 stderr srun: task 0: running
2016-08-31_23:03:14 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 stderr srun: sending Ctrl-C to job 4489.7
2016-08-31_23:03:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 killing orphaned srun process 28336 (task not in slurm queue, no stderr received in last 30s)
2016-08-31_23:03:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 sending 2x signal 15 to pid 28336
2016-08-31_23:03:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 stderr srun: error: compute5: signal: Communication connection failure
2016-08-31_23:03:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925  backing off node compute5 for 60 seconds
2016-08-31_23:03:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 stderr srun: forcing job termination
2016-08-31_23:03:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 stderr srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
2016-08-31_23:03:45 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 stderr srun: error: Timed out waiting for job step to complete
2016-08-31_23:03:46 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 child 28336 on compute5.1 exit 0 success=
2016-08-31_23:03:46 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 ERROR: Task process exited 0, but never updated its task record to indicate success and record its output.
2016-08-31_23:03:46 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 failure (#1, temporary) after 784 seconds
2016-08-31_23:03:46 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 task output (0 bytes): 
2016-08-31_23:03:46 bd44f-8i9sb-c1yqi8yozyqx5hy 24925  status: 0 done, 0 running, 1 todo
2016-08-31_23:04:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 job_task bd44f-ot0gb-689llo62g84sohy
2016-08-31_23:04:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 child 5689 started on compute5.1
2016-08-31_23:04:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 stderr starting: ['srun', ...]
2016-08-31_23:04:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 stderr srun: error: Unable to confirm allocation for job 4489: Invalid job id specified
2016-08-31_23:04:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 stderr srun: Check SLURM_JOB_ID environment variable for expired or invalid job.
2016-08-31_23:04:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 child 5689 on compute5.1 exit 1 success=false
2016-08-31_23:04:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 failure (#2, permanent) after 0 seconds
2016-08-31_23:04:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 task output (0 bytes): 
2016-08-31_23:04:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925  status: 0 done, 0 running, 1 todo
2016-08-31_23:04:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925  wait for last 0 children to finish
2016-08-31_23:04:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925  release job allocation
2016-08-31_23:04:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925  Freeze not implemented
2016-08-31_23:04:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925  collate
2016-08-31_23:04:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925  collated output manifest text to send to API server is 0 bytes with access tokens

Detect the error "Invalid job id specified" and exit EX_RETRY_UNLOCKED so that crunch-dispatch will restart the job with a new allocation.


Subtasks 1 (0 open1 closed)

Task #10015: Review 9924-revoked-working-slotsResolvedPeter Amstutz09/13/2016Actions
Actions #1

Updated by Peter Amstutz over 7 years ago

  • Description updated (diff)
Actions #2

Updated by Peter Amstutz over 7 years ago

  • Description updated (diff)
Actions #3

Updated by Peter Amstutz over 7 years ago

  • Assigned To set to Peter Amstutz
  • Target version set to 2016-09-14 sprint
Actions #4

Updated by Peter Amstutz over 7 years ago

  • Status changed from New to In Progress
Actions #5

Updated by Tom Clegg over 7 years ago

9924-revoked-working-slots @ faf5328

Setting $working_slot_count = 0 here seems superfluous/misleading. It's technically still in scope here, but it really belongs to the task-monitoring loop, and it looks like that loop will overwrite this new value before anyone reads it anyway. So we shouldn't pretend to communicate through it here.

"mark all slots as failed" in the comment should be "mark all nodes as failed".

The word "whoa" from the previous comment is even less helpful now that you've added a proper explanation, so we should probably drop it.

Other than those, LGTM. Thanks.

Actions #6

Updated by Peter Amstutz over 7 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:50b696969d71739e9fd083664de6a81db7e211b3.

Actions

Also available in: Atom PDF