Project

General

Profile

Actions

Bug #6153

closed

SLURM_JOB_ID environment variable expiration persistence on compute node

Added by Bryan Cosca almost 9 years ago. Updated almost 9 years ago.

Status:
Closed
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
-
Story points:
-

Description

A new user ran qr1hi-d1hrv-nfqsgixupqykgok and immediately ran into:

2015-05-27_12:02:56 srun: error: slurm_receive_msg: Socket timed out on send/recv operation
2015-05-27_12:02:56 srun: error: Unable to confirm allocation for job 8: Socket timed out on send/recv operation
2015-05-27_12:02:56 srun: Check SLURM_JOB_ID environment variable for expired or invalid job.
2015-05-27_12:02:56 qr1hi-8i9sb-lpou0rjjwv9qtq4 1488 Installing Docker image from 256f21bb3abfcd8e08a893886bf3e7c0+5082 exited 1 at /usr/local/arvados/src/sdk/cli/bin/crunch-job line 407
2015-05-27_12:02:56 qr1hi-8i9sb-lpou0rjjwv9qtq4 1488 Freeze not implemented
2015-05-27_12:02:56 qr1hi-8i9sb-lpou0rjjwv9qtq4 1488 collate
2015-05-27_12:02:56 qr1hi-8i9sb-lpou0rjjwv9qtq4 1488 collated output manifest text to send to API server is 0 bytes with access tokens
2015-05-27_12:02:56 Collection saved as 'qr1hi-8i9sb-lpou0rjjwv9qtq4.log.txt'
2015-05-27_12:02:56 qr1hi-8i9sb-lpou0rjjwv9qtq4 1488 log manifest is 11e7f91c2365e8fd70b7fc301dfe3dac+83
2015-05-27_12:02:56 Died at /usr/local/arvados/src/sdk/cli/bin/crunch-job line 1578, <DATA> line 1.
2015-05-27_12:02:56 salloc: Relinquishing job allocation 8
2015-05-27_12:02:56 salloc: error: slurm_receive_msg: Socket timed out on send/recv operation
2015-05-27_12:02:56 salloc: error: Unable to clean up job allocation 8: Socket timed out on send/recv operation

Actions #1

Updated by Brett Smith almost 9 years ago

  • Status changed from New to Closed
  • Target version deleted (Bug Triage)

On June 2 we made changes to our SLURM configuration that should prevent these kinds of communication mishaps between compute nodes and the controller.

Actions

Also available in: Atom PDF