Bug #9924: [Crunch] Recover from lost slurm allocation - Arvados

Bug #9924

Updated by Peter Amstutz over 8 years ago

If slurm thinks that a node has failed, it may revoke crunch-job's allocation.    When this happens, the check_squeue() feature of crunch-job may detect it as a "tempfail" but it is impossible for crunch-job to recover: 

 <pre> 
 2016-08-31_23:02:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 notice: task is not in slurm queue but srun process 28336 has not exited 
 2016-08-31_23:03:14 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 killing orphaned srun process 28336 (task not in slurm queue, no stderr received in last 51s) 
 2016-08-31_23:03:14 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 sending 2x signal 2 to pid 28336 
 2016-08-31_23:03:14 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 stderr srun: interrupt (one more within 1 sec to abort) 
 2016-08-31_23:03:14 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 stderr srun: task 0: running 
 2016-08-31_23:03:14 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 stderr srun: sending Ctrl-C to job 4489.7 
 2016-08-31_23:03:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 killing orphaned srun process 28336 (task not in slurm queue, no stderr received in last 30s) 
 2016-08-31_23:03:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 sending 2x signal 15 to pid 28336 
 2016-08-31_23:03:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 stderr srun: error: compute5: signal: Communication connection failure 
 2016-08-31_23:03:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 recover.    backing off node compute5 for 60 seconds 
 2016-08-31_23:03:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 stderr srun: forcing job termination 
 2016-08-31_23:03:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 stderr srun: Job step aborted: Waiting up to 2 seconds for job step to finish. 
 2016-08-31_23:03:45 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 stderr srun: error: Timed out waiting for job step to complete 
 2016-08-31_23:03:46 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 child 28336 on compute5.1 exit 0 success= 
 2016-08-31_23:03:46 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 ERROR: Task process exited 0, but never updated its task record to indicate success and record its output. 
 2016-08-31_23:03:46 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 failure (#1, temporary) after 784 seconds 
 2016-08-31_23:03:46 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 task output (0 bytes):  
 2016-08-31_23:03:46 bd44f-8i9sb-c1yqi8yozyqx5hy 24925    status: 0 done, 0 running, 1 todo 
 2016-08-31_23:04:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 job_task bd44f-ot0gb-689llo62g84sohy 
 2016-08-31_23:04:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 child 5689 started on compute5.1 
 2016-08-31_23:04:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 stderr starting: ['srun', ...] 
 2016-08-31_23:04:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 stderr srun: error: Unable to confirm allocation for job 4489: Invalid job id specified 
 2016-08-31_23:04:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 stderr srun: Check SLURM_JOB_ID environment variable for expired or invalid job. 
 2016-08-31_23:04:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 child 5689 on compute5.1 exit 1 success=false 
 2016-08-31_23:04:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 failure (#2, permanent) after 0 seconds 
 2016-08-31_23:04:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925 0 task output (0 bytes):  
 2016-08-31_23:04:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925    status: 0 done, 0 running, 1 todo 
 2016-08-31_23:04:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925    wait for last 0 children to finish 
 2016-08-31_23:04:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925    release job allocation 
 2016-08-31_23:04:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925    Freeze not implemented 
 2016-08-31_23:04:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925    collate 
 2016-08-31_23:04:44 bd44f-8i9sb-c1yqi8yozyqx5hy 24925    collated output manifest text to send to API server is 0 bytes with access tokens 
 </pre> 

 Detect the error "Invalid job id specified" and exit EX_RETRY_UNLOCKED so that When this happens, crunch-dispatch will should restart the job with a new allocation.

Back

Project

General

Profile

Arvados

Bug #9924