Project

General

Profile

Actions

Bug #9924

closed

[Crunch] Recover from lost slurm allocation

Added by Peter Amstutz over 6 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
09/13/2016
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-

Description

If slurm thinks that a node has failed, it may revoke crunch-job's allocation. When this happens, the check_squeue() feature of crunch-job may detect it as a "tempfail" but it is impossible for crunch-job to recover:

2016-08-31_23:02:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 notice: task is not in slurm queue but srun process 28336 has not exited
2016-08-31_23:03:14 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 killing orphaned srun process 28336 (task not in slurm queue, no stderr received in last 51s)
2016-08-31_23:03:14 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 sending 2x signal 2 to pid 28336
2016-08-31_23:03:14 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 stderr srun: interrupt (one more within 1 sec to abort)
2016-08-31_23:03:14 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 stderr srun: task 0: running
2016-08-31_23:03:14 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 stderr srun: sending Ctrl-C to job 4489.7
2016-08-31_23:03:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 killing orphaned srun process 28336 (task not in slurm queue, no stderr received in last 30s)
2016-08-31_23:03:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 sending 2x signal 15 to pid 28336
2016-08-31_23:03:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 stderr srun: error: compute5: signal: Communication connection failure
2016-08-31_23:03:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925  backing off node compute5 for 60 seconds
2016-08-31_23:03:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 stderr srun: forcing job termination
2016-08-31_23:03:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 stderr srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
2016-08-31_23:03:45 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 stderr srun: error: Timed out waiting for job step to complete
2016-08-31_23:03:46 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 child 28336 on compute5.1 exit 0 success=
2016-08-31_23:03:46 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 ERROR: Task process exited 0, but never updated its task record to indicate success and record its output.
2016-08-31_23:03:46 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 failure (#1, temporary) after 784 seconds
2016-08-31_23:03:46 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 task output (0 bytes): 
2016-08-31_23:03:46 bd44f-8i9sb-c1yqi8yozyqx5hy 24925  status: 0 done, 0 running, 1 todo
2016-08-31_23:04:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 job_task bd44f-ot0gb-689llo62g84sohy
2016-08-31_23:04:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 child 5689 started on compute5.1
2016-08-31_23:04:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 stderr starting: ['srun', ...]
2016-08-31_23:04:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 stderr srun: error: Unable to confirm allocation for job 4489: Invalid job id specified
2016-08-31_23:04:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 stderr srun: Check SLURM_JOB_ID environment variable for expired or invalid job.
2016-08-31_23:04:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 child 5689 on compute5.1 exit 1 success=false
2016-08-31_23:04:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 failure (#2, permanent) after 0 seconds
2016-08-31_23:04:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 task output (0 bytes): 
2016-08-31_23:04:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925  status: 0 done, 0 running, 1 todo
2016-08-31_23:04:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925  wait for last 0 children to finish
2016-08-31_23:04:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925  release job allocation
2016-08-31_23:04:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925  Freeze not implemented
2016-08-31_23:04:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925  collate
2016-08-31_23:04:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925  collated output manifest text to send to API server is 0 bytes with access tokens

Detect the error "Invalid job id specified" and exit EX_RETRY_UNLOCKED so that crunch-dispatch will restart the job with a new allocation.


Subtasks 1 (0 open1 closed)

Task #10015: Review 9924-revoked-working-slotsResolvedPeter Amstutz09/13/2016

Actions
Actions

Also available in: Atom PDF