Bug #4471

[Crunch] srun: error: Application launch failed: Communication connection failure

Added by Nancy Ouyang over 5 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
-
Category:
Crunch
Target version:
Start date:
11/07/2014
Due date:
% Done:

0%

Estimated time:
Story points:
0.5

Description

$ arv run /bin/bash createtwofiles.sh

======
Upload local files: "createtwofiles.sh"
Uploaded to qr1hi-4zz18-ezuts5lpkpj6o6b
Running pipeline qr1hi-d1hrv-zv53r0mhykuj7cq
2014-11-07 22:37:27 arvados.events22240 WARNING: Got exception _ssl.c:331: No root certificates specified for verification of other-side certificates. trying to connect to websockets at wss://ws.qr1hi.arvadosapi.com/websocket
2014-11-07 22:37:27 arvados.events22240 WARNING: Websockets not available, falling back to log table polling
Fri Nov 7 22:37:39 2014 salloc: Granted job allocation 8146
Fri Nov 7 22:37:39 2014 qr1hi-8i9sb-ue3o9q5pi2r0eg7 29981 check slurm allocation
Fri Nov 7 22:37:39 2014 qr1hi-8i9sb-ue3o9q5pi2r0eg7 29981 node compute18 - 8 slots
Fri Nov 7 22:37:40 2014 qr1hi-8i9sb-ue3o9q5pi2r0eg7 29981 start
Fri Nov 7 22:37:40 2014 qr1hi-8i9sb-ue3o9q5pi2r0eg7 29981 Clean work dirs
Fri Nov 7 22:37:41 2014 qr1hi-8i9sb-ue3o9q5pi2r0eg7 29981 Cleanup command exited 0
Fri Nov 7 22:37:41 2014 qr1hi-8i9sb-ue3o9q5pi2r0eg7 29981 Looking for version 8a9fe2e16f1203f303afabc8c88b6e1ded9cec57 from repository arvados
Fri Nov 7 22:37:41 2014 qr1hi-8i9sb-ue3o9q5pi2r0eg7 29981 Using local repository '/var/lib/arvados/internal.git'
Fri Nov 7 22:37:41 2014 qr1hi-8i9sb-ue3o9q5pi2r0eg7 29981 Version 8a9fe2e16f1203f303afabc8c88b6e1ded9cec57 is commit 8a9fe2e16f1203f303afabc8c88b6e1ded9cec57
Fri Nov 7 22:37:41 2014 qr1hi-8i9sb-ue3o9q5pi2r0eg7 29981 Run install script on all workers
Fri Nov 7 22:37:41 2014 srun: error: Task launch for 8146.1 failed on node compute18: Communication connection failure
Fri Nov 7 22:37:41 2014 srun: error: Application launch failed: Communication connection failure
Fri Nov 7 22:37:41 2014 srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
Fri Nov 7 22:37:43 2014 srun: error: Timed out waiting for job step to complete
Fri Nov 7 22:37:43 2014 qr1hi-8i9sb-ue3o9q5pi2r0eg7 29981 Install script exited 1
Fri Nov 7 22:37:48 2014 srun: error: Task launch for 8146.2 failed on node compute18: Communication connection failure
Fri Nov 7 22:37:48 2014 srun: error: Application launch failed: Communication connection failure
Fri Nov 7 22:37:48 2014 srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
Fri Nov 7 22:37:50 2014 srun: error: Timed out waiting for job step to complete
Fri Nov 7 22:37:50 2014 qr1hi-8i9sb-ue3o9q5pi2r0eg7 29981 Installing Docker image from e22cdc86e1acc044f7cf446b37c7ead8+966 exited 1 at /usr/local/arvados/src/sdk/cli/bin/crunch-job line 603
Fri Nov 7 22:37:50 2014 qr1hi-8i9sb-ue3o9q5pi2r0eg7 29981 Freeze not implemented
Fri Nov 7 22:37:50 2014 qr1hi-8i9sb-ue3o9q5pi2r0eg7 29981 collate
Fri Nov 7 22:37:51 2014 Collection saved as 'Saved at 2014-11-07 22:37:40 UTC by '
Fri Nov 7 22:37:51 2014 qr1hi-8i9sb-ue3o9q5pi2r0eg7 29981 log manifest is 13095e803daf57a9c389deca80a46ed0+83
Fri Nov 7 22:37:51 2014 Died at /usr/local/arvados/src/sdk/cli/bin/crunch-job line 1464, <DATA> line 1.
Fri Nov 7 22:37:51 2014 salloc: Relinquishing job allocation 8146
Pipeline is Failed
No output

History

#1 Updated by Nancy Ouyang over 5 years ago

also occurred here:
https://workbench.qr1hi.arvadosapi.com/pipeline_instances/qr1hi-d1hrv-7xe4y6klftdf2gw#Log

Fri Nov 7 22:58:30 2014 srun: error: Task launch for 8148.1 failed on node compute20: Communication connection failure
Fri Nov 7 22:58:30 2014 srun: error: Application launch failed: Communication connection failure
Fri Nov 7 22:58:31 2014 srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
Fri Nov 7 22:58:32 2014 srun: error: Timed out waiting for job step to complete
Fri Nov 7 22:58:32 2014 qr1hi-8i9sb-y7pdmoklllw6xog 29849 Install script exited 1
Fri Nov 7 22:58:37 2014 srun: error: Task launch for 8148.2 failed on node compute20: Communication connection failure
Fri Nov 7 22:58:37 2014 srun: error: Application launch failed: Communication connection failure
Fri Nov 7 22:58:37 2014 srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
Fri Nov 7 22:58:39 2014 srun: error: Timed out waiting for job step to complete

#2 Updated by Tim Pierce over 5 years ago

  • Subject changed from [arv-run] srun: error: Application launch failed: Communication connection failure to [Crunch] srun: error: Application launch failed: Communication connection failure
  • Category set to Crunch
  • Target version set to Bug Triage

#3 Updated by Brett Smith over 5 years ago

  • Assigned To deleted (Nancy Ouyang)

I believe we've addressed this with some configuration changes on our clusters to help keep SLURM robust. I've asked if we can use #4598 to confirm it.

#4 Updated by Tom Clegg over 5 years ago

  • Status changed from New to Feedback

#5 Updated by Tim Pierce over 5 years ago

Although #4598 is not yet done, I've used a dev copy to establish that the most recent job which failed due to this error was qr1hi-8i9sb-cgq4srhv5md7ceu on November 18.

#6 Updated by Tom Clegg about 5 years ago

  • Status changed from Feedback to Resolved

#7 Updated by Ward Vandewege about 5 years ago

  • Target version changed from Bug Triage to 2015-01-28 Sprint

#8 Updated by Ward Vandewege about 5 years ago

  • Story points set to 0.5

Also available in: Atom PDF