Bug #4124

[Crunch] Socket timed out on send/recv operation causes pipeline failure

Added by Sarah Guthrie almost 3 years ago. Updated 3 months ago.

Status:ResolvedStart date:10/07/2014
Priority:NormalDue date:
Assignee:Peter Amstutz% Done:

100%

Category:Crunch
Target version:2017-04-26 sprint
Story points-Remaining (hours)0.00 hour
Velocity based estimate-

Description

This pipeline uses one_task_per_input_file. There are over 800 input files. Each task takes about 20 minutes to run. Soon after one of the tasks finish successfully, this error occurs:

...
Tue Oct 7 17:02:42 2014 9tee4-8i9sb-blb48qaou8uatsi 8876 21 stderr srun: error: slurm_receive_msgs: Socket timed out on send/recv operation
Tue Oct 7 17:02:42 2014 9tee4-8i9sb-blb48qaou8uatsi 8876 21 stderr srun: error: Task launch for 98.24 failed on node compute0: Socket timed out on send/recv operation
Tue Oct 7 17:02:42 2014 9tee4-8i9sb-blb48qaou8uatsi 8876 21 stderr srun: error: Application launch failed: Socket timed out on send/recv operation

Seen on 9tee4 on pipeline instance: https://workbench.9tee4.arvadosapi.com/pipeline_instances/9tee4-d1hrv-5zc2cqp2yzhcmd8

@40000000570d24e3389fa8ac.s (121 KB) Nico César, 03/29/2017 09:17 pm


Subtasks

Task #11485: ReviewClosed

Task #11483: ReviewResolvedTom Clegg

History

#1 Updated by Ward Vandewege almost 3 years ago

  • Target version set to Bug Triage

#2 Updated by Radhika Chippada almost 3 years ago

  • Subject changed from Socket timed out on send/recv operation causes pipeline failure to [Crunch] Socket timed out on send/recv operation causes pipeline failure
  • Category set to Crunch

#3 Updated by Sarah Guthrie almost 3 years ago

The error occurred again on the same type of job described in the initial bug report after 2hr12m

This error did not occur immediately after the first task completion; several other tasks were able to complete successfully.

Pipeline instance:
https://workbench.9tee4.arvadosapi.com/pipeline_instances/9tee4-d1hrv-4hcgjqub7kirf18

Message:
Thu Oct 9 18:01:31 2014 9tee4-8i9sb-esynpi0n4s6sad2 19227 66 stderr srun: error: slurm_receive_msgs: Socket timed out on send/recv operation
Thu Oct 9 18:01:31 2014 9tee4-8i9sb-esynpi0n4s6sad2 19227 66 stderr srun: error: Task launch for 10275.69 failed on node compute0: Socket timed out on send/recv operation
Thu Oct 9 18:01:31 2014 9tee4-8i9sb-esynpi0n4s6sad2 19227 66 stderr srun: error: Application launch failed: Socket timed out on send/recv operation
Thu Oct 9 18:01:31 2014 9tee4-8i9sb-esynpi0n4s6sad2 19227 66 stderr srun: Job step aborted: Waiting up to 2 seconds for job step to finish.

#4 Updated by Brett Smith almost 3 years ago

  • Priority changed from Normal to High
  • Parent task set to #4133

Ward thinks this might be related to some driver warning messages we've been seeing on the 9tee4 nodes.

#5 Updated by Brett Smith almost 3 years ago

  • Status changed from New to In Progress
  • Assignee set to Brett Smith

#6 Updated by Brett Smith almost 3 years ago

Both jobs followed the same pattern:

  • One of the parallel tasks completes. (Note that when Sally said "first task" above, she meant first "real" task, not task 0. If you count task 0, the first job saw two tasks finish.)
  • Crunch promptly dispatches the next task with srun.
  • Ten seconds pass.
  • The new task gets marked as complete in SLURM, and then something immediately tries to cancel it. This fails, since the task no longer exists when the cancel attempts come in.
  • The running tasks are also canceled.

Here's the SLURM scheduler log from the job on the 7th. 98.15 is the task that finished. 98.24 is the task crunch-job starts to take its place, and then fails.

2014-10-07T17:02:32+00:00 sched: _slurm_rpc_job_step_create: StepId=98.24 compute0 usec=468
2014-10-07T17:02:32+00:00 sched: _slurm_rpc_step_complete StepId=98.15 usec=25
2014-10-07T17:02:42+00:00 sched: _slurm_rpc_step_complete StepId=98.24 usec=32
2014-10-07T17:02:42+00:00 job_step_cancel step 98.24 not found
2014-10-07T17:02:42+00:00 Signal 9 of StepId=98.24 by UID=4005: Invalid job id specified
2014-10-07T17:02:43+00:00 job_step_cancel step 98.24 not found
2014-10-07T17:02:43+00:00 Signal 9 of StepId=98.24 by UID=4005: Invalid job id specified
2014-10-07T17:02:52+00:00 step_partial_comp: StepID=98.24 invalid
2014-10-07T17:02:52+00:00 sched: Cancel of StepId=98.4 by UID=4005 usec=222
2014-10-07T17:02:53+00:00 sched: Cancel of StepId=98.11 by UID=4005 usec=227

(it goes on like this for a while)

When you search for these error messages, the general discussion revolves around the scheduler being overloaded and unable to respond promptly. That's clearly not the case here: there's ten seconds of nothing happening on the scheduler between the time the task is started and the time it's aborted.

Right now my feeling is that Ward's packet loss theory has merit, and we're looking at a situation where the scheduler handled the new task no problem, but the compute node never got the ack and so gave up. Still investigating.

#7 Updated by Brett Smith almost 3 years ago

One quick hack workaround might be to increase SLURM's MessageTimeout:

MessageTimeout: Time permitted for a round-trip communication to complete in seconds. Default value is 10 seconds. For systems with shared nodes, the slurmd daemon could be paged out and necessitate higher values.

… but this will only be effective if the compute node asks for acks to be retransmitted. If it doesn't, then we're just deciding the difference between a 10-second wait and a different wait.

#8 Updated by Brett Smith almost 3 years ago

Capturing office conversation: Ward turned off scatter-gather on the network drivers of the relevant hosts, which should help avoid the issue being reported in the kernel logs. Sally is going to try running her pipeline again. We'll see in the morning how it went, and reassess from there.

(Later:) The network driver error messages are gone from the API server's logs, but the job still failed the same way. Back to the drawing board.

#9 Updated by Brett Smith almost 3 years ago

Research progress: we had an idea that this might be something specific to Sally's Crunch script, but it's not. It's also hitting run-command jobs. See 9tee4-8i9sb-mhess4yafb3a1rn and 9tee4-8i9sb-8a6s1wltx3u3jyo.

This appeared in the compute node's daemon.log for the failed task on the 7th:

Oct  7 17:02:52 compute0 slurmd[2737]: launch task 98.24 request from 4005.4005@[9tee4.arvadosapi.com] (port 52170)
Oct  7 17:02:53 compute0 slurmd[2737]: error: _rpc_launch_tasks: unable to send return code to address:port=[9tee4.arvadosapi.com]:51915 msg_type=6001: Transport endpoint is not connected

If you compare against the timestamps in the earlier log, you'll see that 17:02:52 is exactly ten seconds after the scheduler received the request. Because SLURM's MessageTimeout is 10 seconds, it looks like by the time the compute node actually received the task for processing, the scheduler was giving up on it.

The failure on the 9th is even more illustrative:

Oct  9 18:01:36 compute0 slurmd[2737]: launch task 10275.69 request from 4005.4005@[9tee4.arvadosapi.com] (port 928)
Oct  9 18:02:06 compute0 slurmd[2737]: Warning: slurmstepd startup took 14 sec, possible file system problem or full memory
Oct  9 18:02:06 compute0 slurmd[2737]: error: _rpc_launch_tasks: unable to send return code to address:port=[9tee4.arvadosapi.com]:40963 msg_type=6001: Transport endpoint is not connected

The scheduler received the new task request at 18:01:21, so this shows a 15-second gap before that request makes it to the compute node.

Because we haven't seen this on the cloud clusters, this appears to be some kind of 9tee4 configuration issue. At least one of two explanations seems likely:

  • The compute node is so starved for computing resources (most likely memory) that it when it sends the new work to SLURM, it can't actually start that work within the ten-second window.
  • Some kind of network issue is causing packet loss, which causes enough retry behavior to cause the request loop to take longer than the allotted ten seconds.

We have a basic ping test running to check the second theory, and we're working on improving diagnostics for the first.

All this research does suggest that increasing SLURM's MessageTimeout could effectively patch over the situation. But it seems better to resolve the underlying performance issue, since that could potentially affect the compute work itself.

#10 Updated by Brett Smith almost 3 years ago

Brett Smith wrote:

We have a basic ping test running to check the second theory, and we're working on improving diagnostics for the first.

A ping from the 9tee4 API server to compute0 did not drop a single packet all weekend. I don't consider this very conclusive, since I doubt the network stack was under any meaningful load during this time, but there it is.

#11 Updated by Brett Smith almost 3 years ago

I ran the pipeline alongside some system diagnostics to get a better idea of what the bottleneck here is. During the run, no pings were dropped, but up to half of swap was utilized. During this time, my SSH session was very sluggish, in the usual way that heavily swapped systems are. It looks like, for whatever reason, swap on 9tee4 is not as performant as the cloud clusters, and this prevents SLURM from acknowledging new work in time for the SLURM controller.

#12 Updated by Brett Smith almost 3 years ago

  • Assignee changed from Brett Smith to Ward Vandewege

Kicking this over to Ward. He thinks there are ways we might improve swap performance at the hypervisor layer, and that's our next step.

#13 Updated by Brett Smith almost 3 years ago

  • Parent task changed from #4133 to #4134

#14 Updated by Brett Smith over 2 years ago

  • Parent task deleted (#4134)

#15 Updated by Brett Smith over 2 years ago

  • Target version changed from 2014-10-29 sprint to Bug Triage

#16 Updated by Joshua Randall over 1 year ago

I think I ran into this issue today on one of our compute nodes.

Some job log:

2016-03-22_14:15:27 z8ta6-8i9sb-7o8f304ibq87urb 31240 23 stderr srun: error: Task launch for 34104.29 failed on node humgen-05-09: Socket timed out on send/recv operation
2016-03-22_14:15:27 z8ta6-8i9sb-7o8f304ibq87urb 31240 23 stderr srun: error: Application launch failed: Socket timed out on send/recv operation
2016-03-22_14:15:27 z8ta6-8i9sb-7o8f304ibq87urb 31240 23 stderr srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
2016-03-22_14:15:27 z8ta6-8i9sb-7o8f304ibq87urb 31240 24 stderr srun: error: slurm_receive_msgs: Socket timed out on send/recv operation
2016-03-22_14:15:27 z8ta6-8i9sb-7o8f304ibq87urb 31240 24 stderr srun: error: Task launch for 34104.30 failed on node humgen-05-09: Socket timed out on send/recv operation
2016-03-22_14:15:27 z8ta6-8i9sb-7o8f304ibq87urb 31240 24 stderr srun: error: Application launch failed: Socket timed out on send/recv operation
2016-03-22_14:15:27 z8ta6-8i9sb-7o8f304ibq87urb 31240 24 stderr srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
2016-03-22_14:15:27 z8ta6-8i9sb-7o8f304ibq87urb 31240 25 stderr srun: error: slurm_receive_msgs: Socket timed out on send/recv operation
2016-03-22_14:15:27 z8ta6-8i9sb-7o8f304ibq87urb 31240 25 stderr srun: error: Task launch for 34104.31 failed on node humgen-05-09: Socket timed out on send/recv operation
2016-03-22_14:15:27 z8ta6-8i9sb-7o8f304ibq87urb 31240 25 stderr srun: error: Application launch failed: Socket timed out on send/recv operation
2016-03-22_14:15:27 z8ta6-8i9sb-7o8f304ibq87urb 31240 25 stderr srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
2016-03-22_14:15:27 z8ta6-8i9sb-7o8f304ibq87urb 31240  status: 1 done, 32 running, 168 todo

I tried restarting slurm on all nodes (including the master) but it continued to recur.

#17 Updated by Joshua Randall over 1 year ago

This continues to happen to us:

2016-03-31_13:40:57 z8ta6-8i9sb-kwmi2o9c9lakeip 2775 29 stderr srun: error: slurm_receive_msgs: Socket timed out on send/recv operation
2016-03-31_13:40:57 z8ta6-8i9sb-kwmi2o9c9lakeip 2775 29 stderr srun: error: Task launch for 92.35 failed on node humgen-05-09: Socket timed out on send/recv operation
2016-03-31_13:40:57 z8ta6-8i9sb-kwmi2o9c9lakeip 2775 29 stderr srun: error: Application launch failed: Socket timed out on send/recv operation
2016-03-31_13:40:57 z8ta6-8i9sb-kwmi2o9c9lakeip 2775 29 stderr srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
2016-03-31_13:40:57 z8ta6-8i9sb-kwmi2o9c9lakeip 2775 29 stderr srun: error: Timed out waiting for job step to complete
2016-03-31_13:40:57 z8ta6-8i9sb-kwmi2o9c9lakeip 2775 29 child 7669 on humgen-05-09.29 exit 1 success=
2016-03-31_13:40:57 z8ta6-8i9sb-kwmi2o9c9lakeip 2775 29 ERROR: Task process exited 1, but never updated its task record to indicate success and record its output.
2016-03-31_13:41:00 z8ta6-8i9sb-kwmi2o9c9lakeip 2775 29 failure (#1, permanent) after 12 seconds
2016-03-31_13:41:00 z8ta6-8i9sb-kwmi2o9c9lakeip 2775 29 task output (0 bytes): 

The likely reason is that the network between the compute node and the master node is congested (this happens to us alongside several other failures that are related to network congestion causing timeouts - on DNS and LDAP lookups, for example). The problem in our case is that we are completely saturating the 20Gbps link between a blade chassis of compute-only nodes and the switch to which the keep nodes are attached. We have confirmed that the network is saturated by monitoring the switch, so this seems very likely to be the underlying "cause" for us - and it may explain the earlier reported errors as well.

Not sure if there is any way to tune the SLURM network settings to address this...

#18 Updated by Brett Smith over 1 year ago

  • Tracker changed from Task to Bug

#19 Updated by Joshua Randall over 1 year ago

A similar error reported to SLURM in 2013 turned out to be related to LDAP failures: https://bugs.schedmd.com/show_bug.cgi?id=471#c24

So it is possible that the actual error is not with the SLURM rpc having issues with the network congestion - it could (in our system) be related to our LDAP lookups having too low of a timeout when the network is congested.

#20 Updated by Joshua Randall over 1 year ago

Have now added crunch user to local passwd file - will report if this recurs.

getent passwd crunch >> /etc/passwd

#21 Updated by Joshua Randall over 1 year ago

Just had another failure after adding crunch user to passwd file, so I guess this may be a SLURM RPC failure (under heavy network congestion) after all.

#22 Updated by Joshua Randall over 1 year ago

Looks like we are using the default MessageTimeout of 10s, so it is likely that is the problem. Will try raising that to 60s.

#23 Updated by Joshua Randall over 1 year ago

Have not seen this issue recur since we raised the SLURM MessageiImeout to 60s. Suggest adding that to the recommended SLURM configuration.

#24 Updated by Tom Morris 4 months ago

  • Assignee changed from Ward Vandewege to Peter Amstutz
  • Target version changed from Bug Triage to 2017-03-29 sprint

Peter - please review whether the recommended message timeout has already been incorporated and close this out. Thx

#25 Updated by Peter Amstutz 4 months ago

I don't know if we ever updated our default puppet configuration.

We've seen some unexplained SLURM timeouts on some clusters, so this might still be worth doing.

#26 Updated by Peter Amstutz 4 months ago

  • Target version changed from 2017-03-29 sprint to 2017-04-12 sprint

#27 Updated by Tom Morris 4 months ago

  • Assignee changed from Peter Amstutz to Nico César

Review production cluster logs to see if this specific timeout message has been seen. If not, close this. Otherwise, update our config and add a recommendation for customer configs.

#28 Updated by Nico César 4 months ago

well i found this from 2016-04-12 :

manage.su92l:/home/nico# zgrep 'Socket timed out on send/recv operation' /etc/sv/slurmctld/log/main/@* -C3
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_04:26:21.18098 slurmctld: Resending TERMINATE_JOB request JobId=8938 Nodelist=compute129
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_04:26:21.18213 slurmctld: Resending TERMINATE_JOB request JobId=8983 Nodelist=compute14
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_04:26:21.18305 slurmctld: Resending TERMINATE_JOB request JobId=8987 Nodelist=compute33
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_04:26:21.87992 slurmctld: error: slurm_receive_msg: Socket timed out on send/recv operation
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_04:26:21.87994 slurmctld: error: slurm_receive_msg: Socket timed out on send/recv operation
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_04:26:21.87995 slurmctld: error: slurm_receive_msg: Socket timed out on send/recv operation
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_04:26:21.87995 slurmctld: error: slurm_receive_msg: Socket timed out on send/recv operation
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_04:26:21.87996 slurmctld: error: slurm_receive_msg: Socket timed out on send/recv operation
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_04:26:21.87997 slurmctld: error: slurm_receive_msg: Socket timed out on send/recv operation
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_04:26:22.83068 slurmctld: sched: _slurm_rpc_job_step_create: StepId=8986.54 compute88 usec=11058638
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_04:26:22.93038 slurmctld: step_partial_comp: StepID=8986.6 invalid
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_04:26:23.67955 slurmctld: error: slurm_receive_msg: Socket timed out on send/recv operation
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_04:26:23.67958 slurmctld: error: slurm_receive_msg: Socket timed out on send/recv operation
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_04:26:23.92964 slurmctld: error: slurm_receive_msg: Socket timed out on send/recv operation
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_04:26:23.92977 slurmctld: error: slurm_receive_msg: Socket timed out on send/recv operation
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_04:26:25.07971 slurmctld: _slurm_rpc_allocate_resources: Requested nodes are busy
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_04:26:25.13000 slurmctld: completing job 9015
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_04:26:26.77971 slurmctld: sched: job_complete for JobId=9015 successful, exit code=0
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:--
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_04:36:12.70037 slurmctld: Resending TERMINATE_JOB request JobId=8986 Nodelist=compute88
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_04:36:12.70074 slurmctld: Resending TERMINATE_JOB request JobId=8987 Nodelist=compute33
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_04:36:12.70389 slurmctld: step_partial_comp: StepID=8932.8 invalid
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_04:36:22.92982 slurmctld: error: slurm_receive_msgs: Socket timed out on send/recv operation
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_04:36:39.52981 slurmctld: step_partial_comp: StepID=8796.8 invalid
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_04:36:39.68054 slurmctld: agent msg_type=6011 ran for 27 seconds
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_04:36:39.68055 slurmctld: agent msg_type=6011 ran for 27 seconds
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:--
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_04:36:51.51897 slurmctld: sched: _slurm_rpc_allocate_resources JobId=9032 NodeList=compute109 usec=3224
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_04:36:51.54139 slurmctld: sched: _slurm_rpc_job_step_create: StepId=9031.1 compute14 usec=2296
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_04:36:51.62397 slurmctld: sched: _slurm_rpc_step_complete StepId=9030.1 usec=105
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_04:36:51.63988 slurmctld: error: slurm_receive_msg: Socket timed out on send/recv operation
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_04:36:51.65004 slurmctld: error: slurm_receive_msg: Socket timed out on send/recv operation
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_04:36:51.72711 slurmctld: sched: _slurm_rpc_step_complete StepId=9029.5 usec=119
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_04:36:51.76601 slurmctld: error: slurm_receive_msgs: Socket timed out on send/recv operation
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_04:36:51.76602 slurmctld: error: slurm_receive_msgs: Socket timed out on send/recv operation
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_04:36:51.83112 slurmctld: sched: _slurm_rpc_job_step_create: StepId=9029.6 compute105 usec=2357
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_04:36:51.83426 slurmctld: sched: _slurm_rpc_job_step_create: StepId=9030.2 compute63 usec=2344
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_04:36:51.83589 slurmctld: error: slurm_receive_msg: Socket timed out on send/recv operation
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_04:36:51.84600 slurmctld: error: slurm_receive_msg: Socket timed out on send/recv operation
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_04:36:51.87296 slurmctld: error: slurm_receive_msgs: Socket timed out on send/recv operation
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_04:36:51.91593 slurmctld: sched: _slurm_rpc_job_step_create: StepId=9032.0 compute109 usec=1999
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_04:36:51.95948 slurmctld: Job 8680 completion process took 1763 seconds
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_04:36:51.95957 slurmctld: sched: _slurm_rpc_step_complete StepId=9030.2 usec=19308
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:--
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_04:36:52.78103 slurmctld: sched: _slurm_rpc_job_step_create: StepId=9029.7 compute105 usec=3032
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_04:36:52.82243 slurmctld: sched: _slurm_rpc_step_complete StepId=9031.2 usec=10360
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_04:36:52.82895 slurmctld: sched: _slurm_rpc_job_step_create: StepId=9030.5 compute63 usec=1849
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_04:36:52.85327 slurmctld: error: slurm_receive_msg: Socket timed out on send/recv operation
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_04:36:52.86342 slurmctld: error: slurm_receive_msg: Socket timed out on send/recv operation
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_04:36:52.86788 slurmctld: error: slurm_receive_msgs: Socket timed out on send/recv operation
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_04:36:52.93890 slurmctld: sched: _slurm_rpc_job_step_create: StepId=9031.3 compute14 usec=2014
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_04:36:52.98394 slurmctld: error: slurm_receive_msgs: Socket timed out on send/recv operation
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_04:36:52.98396 slurmctld: error: slurm_receive_msgs: Socket timed out on send/recv operation
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_04:36:52.98397 slurmctld: error: slurm_receive_msgs: Socket timed out on send/recv operation
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_04:36:52.98398 slurmctld: error: slurm_receive_msgs: Socket timed out on send/recv operation
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_04:36:52.98398 slurmctld: error: slurm_receive_msgs: Socket timed out on send/recv operation
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_04:36:52.98399 slurmctld: error: slurm_receive_msgs: Socket timed out on send/recv operation
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_04:36:53.02247 slurmctld: sched: _slurm_rpc_step_complete StepId=9031.3 usec=172
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_04:36:53.04483 slurmctld: sched: _slurm_rpc_job_step_create: StepId=9032.1 compute109 usec=3545
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_04:36:53.06526 slurmctld: sched: _slurm_rpc_step_complete StepId=9029.7 usec=138
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:--
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_16:38:38.07252 slurmctld: agent msg_type=1003 ran for 34 seconds
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_16:38:38.65630 slurmctld: sched: _slurm_rpc_allocate_resources JobId=9401 NodeList=compute40 usec=870
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_16:38:39.23506 slurmctld: sched: _slurm_rpc_job_step_create: StepId=9401.0 compute40 usec=1080
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_16:38:48.31479 slurmctld: error: slurm_receive_msgs: Socket timed out on send/recv operation
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_16:38:48.57930 slurmctld: Node compute167 now responding
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_16:38:48.57932 slurmctld: agent msg_type=1003 ran for 32 seconds
/etc/sv/slurmctld/log/main/@40000000570d24e3389fa8ac.s:2016-04-12_16:38:49.15537 slurmctld: agent msg_type=1003 ran for 35 seconds

#29 Updated by Peter Amstutz 3 months ago

  • Target version changed from 2017-04-12 sprint to 2017-04-26 sprint

#30 Updated by Tom Morris 3 months ago

  • Status changed from In Progress to Resolved

#31 Updated by Tom Morris 3 months ago

The instances found in the production logs are a year old, so marking this as obsolete/closed.

Also available in: Atom PDF