Bug #17989

PySDK socket read timeout on big collection create calls

Added by Lucas Di Pentima 3 months ago. Updated 4 days ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
SDKs
Target version:
Start date:
09/07/2021
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-
Release relationship:
Auto

Description

While working on #17948 I found an issue with our Python API client while making collection creation request with big payloads:

Traceback (most recent call last):
  File "tools/test-collection-create/test-collection-create.py", line 165, in <module>
    sys.exit(main())
  File "tools/test-collection-create/test-collection-create.py", line 155, in main
    "manifest_text": manifest
  File "/home/lucas/venv-arvados/lib/python3.7/site-packages/googleapiclient/_helpers.py", line 130, in positional_wrapper
    return wrapped(*args, **kwargs)
  File "/home/lucas/venv-arvados/lib/python3.7/site-packages/googleapiclient/http.py", line 835, in execute
    method=str(self.method), body=self.body, headers=self.headers)
  File "/home/lucas/venv-arvados/lib/python3.7/site-packages/googleapiclient/http.py", line 179, in _retry_request
    raise exception
  File "/home/lucas/venv-arvados/lib/python3.7/site-packages/googleapiclient/http.py", line 162, in _retry_request
    resp, content = http.request(uri, method, *args, **kwargs)
  File "/home/lucas/venv-arvados/lib/python3.7/site-packages/arvados/api.py", line 111, in _intercept_http_request
    return self.orig_http_request(uri, method, headers=headers, **kwargs)
  File "/home/lucas/venv-arvados/lib/python3.7/site-packages/httplib2/__init__.py", line 1709, in request
    conn, authority, uri, request_uri, method, body, headers, redirections, cachekey,
  File "/home/lucas/venv-arvados/lib/python3.7/site-packages/httplib2/__init__.py", line 1424, in _request
    (response, content) = self._conn_request(conn, request_uri, method, body, headers)
  File "/home/lucas/venv-arvados/lib/python3.7/site-packages/httplib2/__init__.py", line 1376, in _conn_request
    response = conn.getresponse()
  File "/usr/lib/python3.7/http/client.py", line 1352, in getresponse
    response.begin()
  File "/usr/lib/python3.7/http/client.py", line 310, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.7/http/client.py", line 271, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/lib/python3.7/socket.py", line 589, in readinto
    return self._sock.recv_into(b)
  File "/usr/lib/python3.7/ssl.py", line 1052, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/lib/python3.7/ssl.py", line 911, in read
    return self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out

The collection got created but the client failed to wait for the response (it seems that only waits for around 12 seconds no matter which timeout setting is used).

For more information: https://dev.arvados.org/issues/17948#note-8


Subtasks

Task #18033: Review 17989-pysdk-timeoutResolvedPeter Amstutz


Related issues

Related to Arvados - Bug #17171: [sdk] unreasonable python SDK timeout when ARVADOS_API_HOST is reachableResolved04/15/2021

History

#1 Updated by Lucas Di Pentima 2 months ago

  • Assigned To set to Lucas Di Pentima

#2 Updated by Lucas Di Pentima about 2 months ago

  • Target version changed from 2021-09-01 sprint to 2021-09-15 sprint

#3 Updated by Lucas Di Pentima about 2 months ago

  • Status changed from New to In Progress

#4 Updated by Lucas Di Pentima about 2 months ago

Updates at ca109d685 - branch 17989-pysdk-timeout
Test run: https://ci.arvados.org/job/developer-run-tests/2680/

  • Fixes the arvados.api_from_config() call so that it uses the specified timeout param.
  • Adds test.
  • As a bonus, lowers the max manifest size limit on the test collection creation script to avoid "payload is too large" errors.

#5 Updated by Peter Amstutz about 2 months ago

Lucas Di Pentima wrote:

Updates at ca109d685 - branch 17989-pysdk-timeout
Test run: https://ci.arvados.org/job/developer-run-tests/2680/

  • Fixes the arvados.api_from_config() call so that it uses the specified timeout param.
  • Adds test.
  • As a bonus, lowers the max manifest size limit on the test collection creation script to avoid "payload is too large" errors.

That was a unfortunate oversight, but it looks like it was easy to find and fix.

On the subject of timeouts:

We already have feedback that a default timeout 10s is too short (e.g. arv-mount reportedly cannot load very large collections).

Timeouts are one of those things that don't matter until you start hitting them and you actually do need it to wait longer, then they matter a lot.

So I have a couple of follow up suggestions:

  • Raise the timeout to 20 or 30 seconds (let's see what Tom and Ward think)
  • increase the default timeout of arv-mount, and/or add a command line flag that lets you adjust the timeout and/or add an enviroment variable ARVADOS_TIMEOUT or something that lets you adjust the default timeout.

#6 Updated by Peter Amstutz about 2 months ago

I should have said, the 17989-pysdk-timeout branch LGTM.

#7 Updated by Peter Amstutz about 2 months ago

  • Related to Bug #17171: [sdk] unreasonable python SDK timeout when ARVADOS_API_HOST is reachable added

#8 Updated by Peter Amstutz about 2 months ago

(copied from slack)

I don't suppose we have any better granularity?

Initial TCP connection/TLS handshake timeout and subsequent response timeout really ought to be different things.

This is one of the reasons we use pycurl for keepclient, because it gave us a more knobs to control exactly when to abandon a connection.

#9 Updated by Peter Amstutz about 2 months ago

From discussion: let's revert the timeout to 300 seconds and investigate further if we can separate the connect/handshake timeout from the response timeout.

#10 Updated by Lucas Di Pentima about 2 months ago

Updates at aab565926 - branch 17989-revert-pysdk-default-timeout
Test run: https://ci.arvados.org/job/developer-run-tests/2681/

  • Reverts arvados.api()'s default timeout to 300 seconds.
  • Updates test.

#11 Updated by Peter Amstutz about 2 months ago

Lucas Di Pentima wrote:

Updates at aab565926 - branch 17989-revert-pysdk-default-timeout
Test run: https://ci.arvados.org/job/developer-run-tests/2681/

  • Reverts arvados.api()'s default timeout to 300 seconds.
  • Updates test.

LGTM

#12 Updated by Lucas Di Pentima about 2 months ago

  • Status changed from In Progress to Resolved

Applied in changeset arvados-private:commit:arvados|834f7c726f727e04250ed6b3eda287ca996e9c38.

#13 Updated by Peter Amstutz about 1 month ago

  • Release set to 43

#14 Updated by Lucas Di Pentima 4 days ago

  • Release changed from 43 to 42

Also available in: Atom PDF