Project

General

Profile

Actions

Bug #8869

closed

[Crunch] Job was repeatedly retried on same bad compute node until abandoned

Added by Brett Smith over 8 years ago. Updated about 1 year ago.

Status:
Closed
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
-
Story points:
-

Description

gatk queue parent job: https://workbench.wx7k5.arvadosapi.com/collections/c224325251c4194e854235c7877ce6f5+89/wx7k5-8i9sb-w0sevdd7ysszqjn.log.txt
child job: wx7k5-8i9sb-f0ygdqygwonamfr

This is the last log, from the logs table:

2016-03-26_20:51:23 salloc: Granted job allocation 228
2016-03-26_20:51:23 13514  Sanity check is `docker.io ps -q`
2016-03-26_20:51:23 13514  sanity check: start
2016-03-26_20:51:23 13514  stderr starting: ['srun','--nodes=1','--ntasks-per-node=1','docker.io','ps','-q']
2016-03-26_20:51:23 13514  stderr srun: error: Task launch for 228.0 failed on node compute15: No such file or directory
2016-03-26_20:51:23 13514  stderr srun: error: Application launch failed: No such file or directory
2016-03-26_20:51:23 13514  stderr srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
2016-03-26_20:51:23 13514  stderr srun: error: Timed out waiting for job step to complete
2016-03-26_20:51:23 13514  sanity check: exit 2
2016-03-26_20:51:23 13514  Sanity check failed: 2
2016-03-26_20:51:23 salloc: Relinquishing job allocation 228

The job was failed immediately after this. That's a little surprising—why wasn't it retried as intended?


Related issues

Related to Arvados - Bug #8810: [Crunch] `docker load` fails to connect to endpoint; srun exits 0ResolvedBrett Smith04/05/2016Actions
Copied from Arvados - Bug #8807: [Crunch] crunch-job doesn't save logs when exiting EX_TEMPFAILClosedBrett Smith03/31/2016Actions
Actions

Also available in: Atom PDF