Bug #9924
Updated by Peter Amstutz over 8 years ago
If slurm thinks that a node has failed, it may revoke crunch-job's allocation. When this happens, the check_squeue() feature of crunch-job may detect it as a "tempfail" but it is impossible for crunch-job to recover: <pre> 2016-08-31_23:02:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 notice: task is not in slurm queue but srun process 28336 has not exited 2016-08-31_23:03:14 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 killing orphaned srun process 28336 (task not in slurm queue, no stderr received in last 51s) 2016-08-31_23:03:14 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 sending 2x signal 2 to pid 28336 2016-08-31_23:03:14 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 stderr srun: interrupt (one more within 1 sec to abort) 2016-08-31_23:03:14 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 stderr srun: task 0: running 2016-08-31_23:03:14 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 stderr srun: sending Ctrl-C to job 4489.7 2016-08-31_23:03:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 killing orphaned srun process 28336 (task not in slurm queue, no stderr received in last 30s) 2016-08-31_23:03:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 sending 2x signal 15 to pid 28336 2016-08-31_23:03:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 stderr srun: error: compute5: signal: Communication connection failure 2016-08-31_23:03:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 recover. backing off node compute5 for 60 seconds 2016-08-31_23:03:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 stderr srun: forcing job termination 2016-08-31_23:03:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 stderr srun: Job step aborted: Waiting up to 2 seconds for job step to finish. 2016-08-31_23:03:45 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 stderr srun: error: Timed out waiting for job step to complete 2016-08-31_23:03:46 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 child 28336 on compute5.1 exit 0 success= 2016-08-31_23:03:46 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 ERROR: Task process exited 0, but never updated its task record to indicate success and record its output. 2016-08-31_23:03:46 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 failure (#1, temporary) after 784 seconds 2016-08-31_23:03:46 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 task output (0 bytes): 2016-08-31_23:03:46 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 status: 0 done, 0 running, 1 todo 2016-08-31_23:04:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 job_task bd44f-ot0gb-689llo62g84sohy 2016-08-31_23:04:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 child 5689 started on compute5.1 2016-08-31_23:04:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 stderr starting: ['srun', ...] 2016-08-31_23:04:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 stderr srun: error: Unable to confirm allocation for job 4489: Invalid job id specified 2016-08-31_23:04:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 stderr srun: Check SLURM_JOB_ID environment variable for expired or invalid job. 2016-08-31_23:04:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 child 5689 on compute5.1 exit 1 success=false 2016-08-31_23:04:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 failure (#2, permanent) after 0 seconds 2016-08-31_23:04:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 task output (0 bytes): 2016-08-31_23:04:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 status: 0 done, 0 running, 1 todo 2016-08-31_23:04:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 wait for last 0 children to finish 2016-08-31_23:04:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 release job allocation 2016-08-31_23:04:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 Freeze not implemented 2016-08-31_23:04:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 collate 2016-08-31_23:04:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 collated output manifest text to send to API server is 0 bytes with access tokens </pre> Detect the error "Invalid job id specified" and exit EX_RETRY_UNLOCKED so that When this happens, crunch-dispatch will should restart the job with a new allocation.