Bug #5562
closed[SDKs] Fix timeout problems by switching from urllib et al. to PycURL
Description
When there are lots of keep clients on one node sending or receiving lots of data to the same keep server on another node, the network link can become saturated and some connections get starved.
The symptom is that some of the clients to have their writes delayed, which results in connections timing out (Unexpected EOF error on the server).
Setting a connection limit which reject connections doesn't work. It's easy for a process to get 4 "Connection rejected" errors in a row and give up.
The underlying problem is that when urllib3 sets the "connection timeout", this applies to all socket operations until the entire request has been sent, at which point it adjusts the timeout to the "read timeout". This means if any single send() operation blocks for more than the "connection_timeout" time, the entire request is aborted.
We could patch urllib3 to add a separate "request timeout" to be set in connection.HTTPConnection._new_conn() after the actual TCP socket is set up. Pushing an upstream patch and waiting for a new release (or maintaining a forked version) could be awkward, though.
Alternately, we could switch to pycurl (python wrapper for libcurl).
Another related solution for writes is to use Expect-continue and 100 Continue to limit the number of simultaneous requests that are transmitting blocks.
Files