Project

General

Profile

Actions

Bug #8191

closed

[Crunch] Tool fails and srun reports error: Abandoning IO 60 secs after job shutdown initiated

Added by Bryan Cosca over 8 years ago. Updated over 4 years ago.

Status:
Closed
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
-
Story points:
-

Description

It looks like a job prematurely finished and no output was recorded because of this error.

2016-01-11_22:17:38 wx7k5-8i9sb-3m1ey3oj3sibv7l 48530 1 stderr 00:07:30.881\011\011\011X:85419918
2016-01-11_22:17:38 wx7k5-8i9sb-3m1ey3oj3sibv7l 48530 1 stderr ERROR: ALT field does not match
2016-01-11_22:17:38 wx7k5-8i9sb-3m1ey3oj3sibv7l 48530 1 stderr \011VCF entry dbsnp_138\011X\01185470602\011rs201008478\011TC\011T\011.\011.\011CAF=[0.9553,0.04474];COMMON=1;INT;KGPROD;KGPhase1;KGPilot123;RS=201008478;RSPOS=85470603;SAO=0;SSR=0;VC=DIV;VP=0x05000008000110001c000200;WGT=1;dbSNPBuildID=137
2016-01-11_22:17:38 wx7k5-8i9sb-3m1ey3oj3sibv7l 48530 1 stderr srun: error: Abandoning IO 60 secs after job shutdown initiated
2016-01-11_22:17:38 wx7k5-8i9sb-3m1ey3oj3sibv7l 48530 1 child 50302 on compute2.1 exit 1 success=
2016-01-11_22:17:38 wx7k5-8i9sb-3m1ey3oj3sibv7l 48530 1 ERROR: Task process exited 1, but never updated its task record to indicate success and record its output.
2016-01-11_22:17:39 wx7k5-8i9sb-3m1ey3oj3sibv7l 48530 1 failure (#1, permanent) after 1009 seconds

https://workbench.wx7k5.arvadosapi.com/collections/d0cd77591c4db547ce1f9daf62341eaf+91/wx7k5-8i9sb-3m1ey3oj3sibv7l.log.txt


Related issues

Related to Arvados - Bug #8284: [Crunch] Task killed even though it is still runningResolvedTom Clegg01/23/2016Actions
Actions #1

Updated by Bryan Cosca over 8 years ago

I reran the job with a sdk version: 722e147756526579ba32a31f967e9e00d47fd3ed

(before I used 92768ce858673678aa7924f83ad41e2a9f8dd678)

https://workbench.wx7k5.arvadosapi.com/pipeline_instances/wx7k5-d1hrv-ejxs4nxc0dy21xr

and it worked, no longer blocked.

Actions #2

Updated by Brett Smith over 8 years ago

  • Target version set to Arvados Future Sprints
Actions #3

Updated by Bryan Cosca over 8 years ago

This job used 722e147756526579ba32a31f967e9e00d47fd3ed

https://workbench.wx7k5.arvadosapi.com/collections/e6ee0c7858d44fec27cd245ed98b0117+91/wx7k5-8i9sb-o0dzi5z6vlq204g.log.txt

and failed with the same error.

Actions #4

Updated by Bryan Cosca over 8 years ago

Also I think its currently blocking #7933, its a little hard to tell, but I think this is making the job end prematurely.

Actions #5

Updated by Tom Clegg over 8 years ago

Ideas about situations that might cause this:
  • It takes >60 seconds to process the buffered stderr after the slurm jobstep exits. The process already exited 1 for some unrelated reason, but we don't see the relevant part of stderr because slurm cut us off before we got to it.
  • The task process forks/detaches a child process that is still running after the task process exits, and stderr is still coming from that daemon process for >= 60s when slurm decides something is wrong. (But: wouldn't "docker run" shut down the container and kill off any such daemon processs before exiting?)
Actions #6

Updated by Bryan Cosca over 8 years ago

https://workbench.wx7k5.arvadosapi.com/jobs/wx7k5-8i9sb-6k7kp8hgr81hicn#Status

This job has the stderr print to a file, and the error does not show up.

Actions #7

Updated by Brett Smith about 8 years ago

  • Subject changed from srun: error: Abandoning IO 60 secs after job shutdown initiated to [Crunch] Tool fails and srun reports error: Abandoning IO 60 secs after job shutdown initiated

Bryan Cosca wrote:

https://workbench.wx7k5.arvadosapi.com/jobs/wx7k5-8i9sb-6k7kp8hgr81hicn#Status

This job has the stderr print to a file, and the error does not show up.

I think I'm convinced this is basically a problem in snpsift, and SLURM is actually doing what we want it to do.

snpsift writes to stderr prolifically. So much so, SLURM can't send it all over the network within 60 seconds of snpsift finishing.

It's possible that these things are related: that snpsift is exiting with an error because it can't write to stderr immediately, because the buffer is full, or something like that. But I don't think we want to turn off SLURM's I/O timeout to fix this: that could threaten the stability of the cluster more generally.

Writing stderr to a file as you've done seems like a decent fix. Is all that stderr actually useful, though? If not, you might consider calling snpsift with switches to turn down the messages, or piping its stderr through grep to filter out some of the less useful messages.

Actions #8

Updated by Brett Smith about 8 years ago

  • Target version deleted (Arvados Future Sprints)
Actions #9

Updated by Peter Amstutz over 4 years ago

  • Status changed from New to Closed
Actions

Also available in: Atom PDF