Bug #4124

[Crunch] Socket timed out on send/recv operation causes pipeline failure

Added by Sarah Guthrie over 2 years ago. Updated 13 days ago.

Status:In ProgressStart date:10/07/2014
Priority:HighDue date:
Assignee:Peter Amstutz% Done:

0%

Category:Crunch
Target version:2017-03-29 sprint
Story points-
Velocity based estimate-

Description

This pipeline uses one_task_per_input_file. There are over 800 input files. Each task takes about 20 minutes to run. Soon after one of the tasks finish successfully, this error occurs:

...
Tue Oct 7 17:02:42 2014 9tee4-8i9sb-blb48qaou8uatsi 8876 21 stderr srun: error: slurm_receive_msgs: Socket timed out on send/recv operation
Tue Oct 7 17:02:42 2014 9tee4-8i9sb-blb48qaou8uatsi 8876 21 stderr srun: error: Task launch for 98.24 failed on node compute0: Socket timed out on send/recv operation
Tue Oct 7 17:02:42 2014 9tee4-8i9sb-blb48qaou8uatsi 8876 21 stderr srun: error: Application launch failed: Socket timed out on send/recv operation

Seen on 9tee4 on pipeline instance: https://workbench.9tee4.arvadosapi.com/pipeline_instances/9tee4-d1hrv-5zc2cqp2yzhcmd8

History

#1 Updated by Ward Vandewege over 2 years ago

  • Target version set to Bug Triage

#2 Updated by Radhika Chippada over 2 years ago

  • Subject changed from Socket timed out on send/recv operation causes pipeline failure to [Crunch] Socket timed out on send/recv operation causes pipeline failure
  • Category set to Crunch

#3 Updated by Sarah Guthrie over 2 years ago

The error occurred again on the same type of job described in the initial bug report after 2hr12m

This error did not occur immediately after the first task completion; several other tasks were able to complete successfully.

Pipeline instance:
https://workbench.9tee4.arvadosapi.com/pipeline_instances/9tee4-d1hrv-4hcgjqub7kirf18

Message:
Thu Oct 9 18:01:31 2014 9tee4-8i9sb-esynpi0n4s6sad2 19227 66 stderr srun: error: slurm_receive_msgs: Socket timed out on send/recv operation
Thu Oct 9 18:01:31 2014 9tee4-8i9sb-esynpi0n4s6sad2 19227 66 stderr srun: error: Task launch for 10275.69 failed on node compute0: Socket timed out on send/recv operation
Thu Oct 9 18:01:31 2014 9tee4-8i9sb-esynpi0n4s6sad2 19227 66 stderr srun: error: Application launch failed: Socket timed out on send/recv operation
Thu Oct 9 18:01:31 2014 9tee4-8i9sb-esynpi0n4s6sad2 19227 66 stderr srun: Job step aborted: Waiting up to 2 seconds for job step to finish.

#4 Updated by Brett Smith over 2 years ago

  • Priority changed from Normal to High
  • Parent task set to #4133

Ward thinks this might be related to some driver warning messages we've been seeing on the 9tee4 nodes.

#5 Updated by Brett Smith over 2 years ago

  • Status changed from New to In Progress
  • Assignee set to Brett Smith

#6 Updated by Brett Smith over 2 years ago

Both jobs followed the same pattern:

  • One of the parallel tasks completes. (Note that when Sally said "first task" above, she meant first "real" task, not task 0. If you count task 0, the first job saw two tasks finish.)
  • Crunch promptly dispatches the next task with srun.
  • Ten seconds pass.
  • The new task gets marked as complete in SLURM, and then something immediately tries to cancel it. This fails, since the task no longer exists when the cancel attempts come in.
  • The running tasks are also canceled.

Here's the SLURM scheduler log from the job on the 7th. 98.15 is the task that finished. 98.24 is the task crunch-job starts to take its place, and then fails.

2014-10-07T17:02:32+00:00 sched: _slurm_rpc_job_step_create: StepId=98.24 compute0 usec=468
2014-10-07T17:02:32+00:00 sched: _slurm_rpc_step_complete StepId=98.15 usec=25
2014-10-07T17:02:42+00:00 sched: _slurm_rpc_step_complete StepId=98.24 usec=32
2014-10-07T17:02:42+00:00 job_step_cancel step 98.24 not found
2014-10-07T17:02:42+00:00 Signal 9 of StepId=98.24 by UID=4005: Invalid job id specified
2014-10-07T17:02:43+00:00 job_step_cancel step 98.24 not found
2014-10-07T17:02:43+00:00 Signal 9 of StepId=98.24 by UID=4005: Invalid job id specified
2014-10-07T17:02:52+00:00 step_partial_comp: StepID=98.24 invalid
2014-10-07T17:02:52+00:00 sched: Cancel of StepId=98.4 by UID=4005 usec=222
2014-10-07T17:02:53+00:00 sched: Cancel of StepId=98.11 by UID=4005 usec=227

(it goes on like this for a while)

When you search for these error messages, the general discussion revolves around the scheduler being overloaded and unable to respond promptly. That's clearly not the case here: there's ten seconds of nothing happening on the scheduler between the time the task is started and the time it's aborted.

Right now my feeling is that Ward's packet loss theory has merit, and we're looking at a situation where the scheduler handled the new task no problem, but the compute node never got the ack and so gave up. Still investigating.

#7 Updated by Brett Smith over 2 years ago

One quick hack workaround might be to increase SLURM's MessageTimeout:

MessageTimeout: Time permitted for a round-trip communication to complete in seconds. Default value is 10 seconds. For systems with shared nodes, the slurmd daemon could be paged out and necessitate higher values.

… but this will only be effective if the compute node asks for acks to be retransmitted. If it doesn't, then we're just deciding the difference between a 10-second wait and a different wait.

#8 Updated by Brett Smith over 2 years ago

Capturing office conversation: Ward turned off scatter-gather on the network drivers of the relevant hosts, which should help avoid the issue being reported in the kernel logs. Sally is going to try running her pipeline again. We'll see in the morning how it went, and reassess from there.

(Later:) The network driver error messages are gone from the API server's logs, but the job still failed the same way. Back to the drawing board.

#9 Updated by Brett Smith over 2 years ago

Research progress: we had an idea that this might be something specific to Sally's Crunch script, but it's not. It's also hitting run-command jobs. See 9tee4-8i9sb-mhess4yafb3a1rn and 9tee4-8i9sb-8a6s1wltx3u3jyo.

This appeared in the compute node's daemon.log for the failed task on the 7th:

Oct  7 17:02:52 compute0 slurmd[2737]: launch task 98.24 request from 4005.4005@[9tee4.arvadosapi.com] (port 52170)
Oct  7 17:02:53 compute0 slurmd[2737]: error: _rpc_launch_tasks: unable to send return code to address:port=[9tee4.arvadosapi.com]:51915 msg_type=6001: Transport endpoint is not connected

If you compare against the timestamps in the earlier log, you'll see that 17:02:52 is exactly ten seconds after the scheduler received the request. Because SLURM's MessageTimeout is 10 seconds, it looks like by the time the compute node actually received the task for processing, the scheduler was giving up on it.

The failure on the 9th is even more illustrative:

Oct  9 18:01:36 compute0 slurmd[2737]: launch task 10275.69 request from 4005.4005@[9tee4.arvadosapi.com] (port 928)
Oct  9 18:02:06 compute0 slurmd[2737]: Warning: slurmstepd startup took 14 sec, possible file system problem or full memory
Oct  9 18:02:06 compute0 slurmd[2737]: error: _rpc_launch_tasks: unable to send return code to address:port=[9tee4.arvadosapi.com]:40963 msg_type=6001: Transport endpoint is not connected

The scheduler received the new task request at 18:01:21, so this shows a 15-second gap before that request makes it to the compute node.

Because we haven't seen this on the cloud clusters, this appears to be some kind of 9tee4 configuration issue. At least one of two explanations seems likely:

  • The compute node is so starved for computing resources (most likely memory) that it when it sends the new work to SLURM, it can't actually start that work within the ten-second window.
  • Some kind of network issue is causing packet loss, which causes enough retry behavior to cause the request loop to take longer than the allotted ten seconds.

We have a basic ping test running to check the second theory, and we're working on improving diagnostics for the first.

All this research does suggest that increasing SLURM's MessageTimeout could effectively patch over the situation. But it seems better to resolve the underlying performance issue, since that could potentially affect the compute work itself.

#10 Updated by Brett Smith over 2 years ago

Brett Smith wrote:

We have a basic ping test running to check the second theory, and we're working on improving diagnostics for the first.

A ping from the 9tee4 API server to compute0 did not drop a single packet all weekend. I don't consider this very conclusive, since I doubt the network stack was under any meaningful load during this time, but there it is.

#11 Updated by Brett Smith over 2 years ago

I ran the pipeline alongside some system diagnostics to get a better idea of what the bottleneck here is. During the run, no pings were dropped, but up to half of swap was utilized. During this time, my SSH session was very sluggish, in the usual way that heavily swapped systems are. It looks like, for whatever reason, swap on 9tee4 is not as performant as the cloud clusters, and this prevents SLURM from acknowledging new work in time for the SLURM controller.

#12 Updated by Brett Smith over 2 years ago

  • Assignee changed from Brett Smith to Ward Vandewege

Kicking this over to Ward. He thinks there are ways we might improve swap performance at the hypervisor layer, and that's our next step.

#13 Updated by Brett Smith over 2 years ago

  • Parent task changed from #4133 to #4134

#14 Updated by Brett Smith over 2 years ago

  • Parent task deleted (#4134)

#15 Updated by Brett Smith over 2 years ago

  • Target version changed from 2014-10-29 sprint to Bug Triage

#16 Updated by Joshua Randall about 1 year ago

I think I ran into this issue today on one of our compute nodes.

Some job log:

2016-03-22_14:15:27 z8ta6-8i9sb-7o8f304ibq87urb 31240 23 stderr srun: error: Task launch for 34104.29 failed on node humgen-05-09: Socket timed out on send/recv operation
2016-03-22_14:15:27 z8ta6-8i9sb-7o8f304ibq87urb 31240 23 stderr srun: error: Application launch failed: Socket timed out on send/recv operation
2016-03-22_14:15:27 z8ta6-8i9sb-7o8f304ibq87urb 31240 23 stderr srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
2016-03-22_14:15:27 z8ta6-8i9sb-7o8f304ibq87urb 31240 24 stderr srun: error: slurm_receive_msgs: Socket timed out on send/recv operation
2016-03-22_14:15:27 z8ta6-8i9sb-7o8f304ibq87urb 31240 24 stderr srun: error: Task launch for 34104.30 failed on node humgen-05-09: Socket timed out on send/recv operation
2016-03-22_14:15:27 z8ta6-8i9sb-7o8f304ibq87urb 31240 24 stderr srun: error: Application launch failed: Socket timed out on send/recv operation
2016-03-22_14:15:27 z8ta6-8i9sb-7o8f304ibq87urb 31240 24 stderr srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
2016-03-22_14:15:27 z8ta6-8i9sb-7o8f304ibq87urb 31240 25 stderr srun: error: slurm_receive_msgs: Socket timed out on send/recv operation
2016-03-22_14:15:27 z8ta6-8i9sb-7o8f304ibq87urb 31240 25 stderr srun: error: Task launch for 34104.31 failed on node humgen-05-09: Socket timed out on send/recv operation
2016-03-22_14:15:27 z8ta6-8i9sb-7o8f304ibq87urb 31240 25 stderr srun: error: Application launch failed: Socket timed out on send/recv operation
2016-03-22_14:15:27 z8ta6-8i9sb-7o8f304ibq87urb 31240 25 stderr srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
2016-03-22_14:15:27 z8ta6-8i9sb-7o8f304ibq87urb 31240  status: 1 done, 32 running, 168 todo

I tried restarting slurm on all nodes (including the master) but it continued to recur.

#17 Updated by Joshua Randall 12 months ago

This continues to happen to us:

2016-03-31_13:40:57 z8ta6-8i9sb-kwmi2o9c9lakeip 2775 29 stderr srun: error: slurm_receive_msgs: Socket timed out on send/recv operation
2016-03-31_13:40:57 z8ta6-8i9sb-kwmi2o9c9lakeip 2775 29 stderr srun: error: Task launch for 92.35 failed on node humgen-05-09: Socket timed out on send/recv operation
2016-03-31_13:40:57 z8ta6-8i9sb-kwmi2o9c9lakeip 2775 29 stderr srun: error: Application launch failed: Socket timed out on send/recv operation
2016-03-31_13:40:57 z8ta6-8i9sb-kwmi2o9c9lakeip 2775 29 stderr srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
2016-03-31_13:40:57 z8ta6-8i9sb-kwmi2o9c9lakeip 2775 29 stderr srun: error: Timed out waiting for job step to complete
2016-03-31_13:40:57 z8ta6-8i9sb-kwmi2o9c9lakeip 2775 29 child 7669 on humgen-05-09.29 exit 1 success=
2016-03-31_13:40:57 z8ta6-8i9sb-kwmi2o9c9lakeip 2775 29 ERROR: Task process exited 1, but never updated its task record to indicate success and record its output.
2016-03-31_13:41:00 z8ta6-8i9sb-kwmi2o9c9lakeip 2775 29 failure (#1, permanent) after 12 seconds
2016-03-31_13:41:00 z8ta6-8i9sb-kwmi2o9c9lakeip 2775 29 task output (0 bytes): 

The likely reason is that the network between the compute node and the master node is congested (this happens to us alongside several other failures that are related to network congestion causing timeouts - on DNS and LDAP lookups, for example). The problem in our case is that we are completely saturating the 20Gbps link between a blade chassis of compute-only nodes and the switch to which the keep nodes are attached. We have confirmed that the network is saturated by monitoring the switch, so this seems very likely to be the underlying "cause" for us - and it may explain the earlier reported errors as well.

Not sure if there is any way to tune the SLURM network settings to address this...

#18 Updated by Brett Smith 12 months ago

  • Tracker changed from Task to Bug

#19 Updated by Joshua Randall 12 months ago

A similar error reported to SLURM in 2013 turned out to be related to LDAP failures: https://bugs.schedmd.com/show_bug.cgi?id=471#c24

So it is possible that the actual error is not with the SLURM rpc having issues with the network congestion - it could (in our system) be related to our LDAP lookups having too low of a timeout when the network is congested.

#20 Updated by Joshua Randall 12 months ago

Have now added crunch user to local passwd file - will report if this recurs.

getent passwd crunch >> /etc/passwd

#21 Updated by Joshua Randall 12 months ago

Just had another failure after adding crunch user to passwd file, so I guess this may be a SLURM RPC failure (under heavy network congestion) after all.

#22 Updated by Joshua Randall 12 months ago

Looks like we are using the default MessageTimeout of 10s, so it is likely that is the problem. Will try raising that to 60s.

#23 Updated by Joshua Randall 12 months ago

Have not seen this issue recur since we raised the SLURM MessageiImeout to 60s. Suggest adding that to the recommended SLURM configuration.

#24 Updated by Tom Morris 15 days ago

  • Assignee changed from Ward Vandewege to Peter Amstutz
  • Target version changed from Bug Triage to 2017-03-29 sprint

Peter - please review whether the recommended message timeout has already been incorporated and close this out. Thx

#25 Updated by Peter Amstutz 13 days ago

I don't know if we ever updated our default puppet configuration.

We've seen some unexplained SLURM timeouts on some clusters, so this might still be worth doing.

Also available in: Atom PDF