Project

General

Profile

Actions

Bug #12684

closed

Let user specify a retry strategy on the client object, used for all API calls

Added by Tom Morris almost 7 years ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
SDKs
Story points:
2.0
Release relationship:
Auto

Description

Updated Feb 23, 2023

The SDK either doesn't retry at all, or doesn't retry enough. Requiring end users to manually provide num_retries on every call is not a (human) scalable solution.

  • The number of retries should be settable when creating the API object
    • This should be used for requests made while initializing the client object, like fetching the discovery document
  • The default retry count should be much more robust -- like 8 retries

Old ticket

This sounds like what #3147 was intended to address, but it's apparently not working:

Traceback (most recent call last):
  File "./myg_runs.py", line 244, in <module>
    main()
  File "./myg_runs.py", line 230, in main
    dump_subprojects(stats, project, SKIP_PROJECTS)
  File "./myg_runs.py", line 210, in dump_subprojects
    dump_pipeline_instances(stats, sp)
  File "./myg_runs.py", line 182, in dump_pipeline_instances
    time = dump_pipeline_instance(stats, i)
  File "./myg_runs.py", line 167, in dump_pipeline_instance
    dump_jobs(batchid, sample, cwl_runner['job']['components'])
  File "./myg_runs.py", line 84, in dump_jobs
    jobs = api.jobs().list(filters=[['uuid','=',job_uuid]]).execute()
  File "/usr/lib/python2.7/dist-packages/oauth2client/util.py", line 140, in positional_wrapper
    return wrapped(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/googleapiclient/http.py", line 840, in execute
    raise HttpError(resp, content, uri=self.uri)
arvados.errors.ApiError: <HttpError 502 when requesting https://e51c5.arvadosapi.com/arvados/v1/jobs?alt=json&filters=%5B%5B%22uuid%22%2C+%22%3D%22%2C+%22e51c5-8i9sb-b8od8nvombxq3h3%22%5D%5D returned "Bad Gateway">

Files

fuse.log (41.8 KB) fuse.log Brett Smith, 05/08/2023 08:49 PM

Subtasks 1 (0 open1 closed)

Task #20490: Review 12684-pysdk-auto-retryResolvedBrett Smith05/09/2023Actions

Related issues

Related to Arvados - Bug #3147: [SDKs] Python clients should automatically retry failed API and Keep requests (including timeouts), in order to survive temporary outages like server restarts and network blips.ResolvedBrett Smith08/22/2014Actions
Related to Arvados - Idea #20107: Research retry strategies when SDK API calls return 5xx errorsResolvedBrett SmithActions
Related to Arvados - Bug #20432: Improve CWL runner handling 503 errorsResolvedBrett SmithActions
Related to Arvados - Idea #7697: [SDKs] Improve socket error handling in PySDKResolvedTom Clegg10/29/2015Actions
Related to Arvados - Bug #7587: [SDKs] Python Google API client raises Broken Pipe socket.error after sitting idle for some timeResolvedBrett Smith10/22/2015Actions
Related to Arvados - Bug #20425: Inappropriate auto-retry when looking up collection by PDHResolvedTom Clegg06/09/2023Actions
Related to Arvados - Idea #20521: Python tools emit logs from googleapiclientNewActions
Related to Arvados - Idea #8539: [SDKs/FUSE] Better retry defaultsResolvedActions
Actions #1

Updated by Tom Morris almost 7 years ago

  • Related to Bug #3147: [SDKs] Python clients should automatically retry failed API and Keep requests (including timeouts), in order to survive temporary outages like server restarts and network blips. added
Actions #2

Updated by Lucas Di Pentima almost 7 years ago

It seems that the api client object already has a default retry value of 2 (https://github.com/curoverse/arvados/blob/master/sdk/python/arvados/api.py#L33), and the retry code may be missing some exception catching:

https://github.com/curoverse/arvados/blob/master/sdk/python/arvados/api.py#L69-L101

Actions #3

Updated by Peter Amstutz almost 7 years ago

  • Tracker changed from Feature to Bug
Actions #5

Updated by Tom Morris over 5 years ago

The current SDK defaults are 2 retries with an initial sleep period of 2 seconds and a multiplier of 2 which translates to 3 attempts over 6 seconds (at 0, 2, 4 seconds).

Although it doesn't look like we're using it, the Google API client library has retry support built in:

https://googleapis.github.io/google-api-python-client/docs/epy/googleapiclient.http-module.html#_retry_request
https://googleapis.github.io/google-api-python-client/docs/epy/googleapiclient.http-pysrc.html#HttpRequest.execute

but their algorithm is different due to the use of randomization and a fixed base period and multiplier

       sleep_time = rand() * 2 ** retry_num 

The only indication as to whether retries were attempted is a debug level logging message, so I suggest we upgrade that to warning level, like the Google API client library does. Without that there's no way to tell whether the exception came on the final attempt and wasn't intended to be caught or whether it's a retryable exception that's not being caught for some reason while the retries are still in process.

Actions #6

Updated by Peter Amstutz over 3 years ago

  • Target version deleted (To Be Groomed)
Actions #7

Updated by Peter Amstutz almost 2 years ago

  • Release set to 60
Actions #8

Updated by Peter Amstutz over 1 year ago

  • Target version set to Future
Actions #9

Updated by Peter Amstutz over 1 year ago

  • Release deleted (60)
  • Assigned To set to Brett Smith
Actions #10

Updated by Peter Amstutz over 1 year ago

  • Description updated (diff)
Actions #11

Updated by Brett Smith over 1 year ago

  • Category set to SDKs
  • Subject changed from [Python SDK] Retry on HTTP 5xx errors to Let user specify a retry strategy on the client object, used for all API calls
Actions #12

Updated by Brett Smith over 1 year ago

This is related to #20107, it would be really nice for us as a team to get on the same page about what our "retry philosophy" is, and then aim to implement that. It would be especially nice if all our SDKs implemented the same strategy.

Actions #13

Updated by Brett Smith over 1 year ago

  • Related to Idea #20107: Research retry strategies when SDK API calls return 5xx errors added
Actions #14

Updated by Brett Smith over 1 year ago

  • Description updated (diff)
Actions #15

Updated by Brett Smith over 1 year ago

I'm gonna start digging into this, but one implementation idea I had: maybe we should implement our own retry logic at the HTTP client layer. Pros:

  • We may want to implement a different retry algorithm than Google does, and this gives us a straightforward way to do that.
  • The Google API client library already supports users providing an arbitrary HTTP client object, so this uses supported interfaces, vs. trying to patch the Google API client itself at runtime when the client is dynamically generated.
  • Retry support for things that aren't API requests, like fetching the discovery document, happens automatically for free.
  • It provides the best backwards compatibility: clients that are already using .execute(num_retries=N) will continue to get the retry logic they already expect. They can switch when they're ready to do that by passing a retries argument to their API client constructor.

The main con is now there's two layers of retries, and it's not obvious which is which. But if we take care to switch over all of our documentation ASAP, and encourage existing users to switch over, probably this will be a non-problem after we get through a transition period.

Actions #16

Updated by Brett Smith over 1 year ago

  • Related to Bug #20432: Improve CWL runner handling 503 errors added
Actions #17

Updated by Brett Smith over 1 year ago

The patchy solution is to patch googleapiclient.http._retry_request, probably to wrap it with code like:

try:
    num_retries = max(num_retries, http.num_retries)
except AttributeError:
    pass
return _orig_retry_request(http, num_retries, *args, **kwargs)

We then just set num_retries on our http object, like we already do for a lot of other extra flags.

_retry_request has been in the library since 2.1.0, the minimum version we support, to present day. Because the library is in maintenance mode, it seems unlikely to change significantly, so I'm not as worried about the maintenance burden on us.

The main downside of this approach is, assuming we have the Python SDK default to a large-ish num_retries setting, this will change the behavior of existing client code that makes calls like .execute(num_retries=small_n). It's not an API-breaking change, arguably it's an improvement in behavior, and anybody who wants the original behavior can get it by constructing their API client object with num_retries=0. But still worth noting.

On balance I think I'm leaning towards the patchy solution. The downside of "two retry stacks that can retry requests independently" seems worse than the downsides of "patching internal functions (which we already do) and behavior change."

Actions #18

Updated by Brett Smith over 1 year ago

  • Related to Idea #7697: [SDKs] Improve socket error handling in PySDK added
Actions #19

Updated by Brett Smith over 1 year ago

  • Related to deleted (Idea #7697: [SDKs] Improve socket error handling in PySDK)
Actions #20

Updated by Brett Smith over 1 year ago

  • Related to Idea #7697: [SDKs] Improve socket error handling in PySDK added
Actions #21

Updated by Brett Smith over 1 year ago

  • Related to Bug #7587: [SDKs] Python Google API client raises Broken Pipe socket.error after sitting idle for some time added
Actions #22

Updated by Brett Smith over 1 year ago

Brett Smith wrote in #note-15:

I'm gonna start digging into this, but one implementation idea I had: maybe we should implement our own retry logic at the HTTP client layer.

It turns out we kind of already have this, in arvados.api._intercept_http_request. arvados.api.api calls arvados.api._patch_http_request to install it as an instance method on the HTTP object. Note this wrapper does other work besides retrying: it does additional logging, proactively closes sockets it believes are likely to be stale and fail, and extends some exception messages to try to help users debug their own problems.

With this, a typical call stack for an API request is:

googleapiclient.http.execute
googleapiclient.http._retry_request
arvados.api._intercept_http_request
httplib2.Http.request (or something API-compatible)

arvados.api._intercept_http_request retries http.client.HTTPException and socket.error. Note that HTTPException is only raised for malformed HTTP responses, not well-formed 4xx and 5xx error responses. (The function also catches ssl.SSLCertVerificationError, but only to extend the error message, not retry it.)

googleapiclient.http._retry_request retries HTTP errors 5xx, 429, and 403 iff the response indicates it's due to rate limiting; ssl.SSLError; socket.timeout; ConnectionError; OSError/socket.error with an errno matching WSATIMEDOUT, ETIMEDOUT, EPIPE, ECONNABORTED, ECONNREFUSED, or ECONNRESET; and httplib2.ServerNotFoundError.

In general, Google's retry code seems better than ours. It covers more cases, and covers them with more nuance. The only thing we retry that it doesn't look at is http.client.HTTPException. Since I've rarely seen malformed HTTP responses in the wild, I started wondering, what was our original motivation for catching this exception? I followed the trail aaaallllllllllll the way back to:

commit 5722c604c6f5dc1553674d179ec016ec12e2b090
Author: Tom Clegg <tom@clinicalfuture.com>
Date:   Tue Jun 4 13:34:16 2013 -0400

    retry requests after BadStatusLine

Note the comment in that commit:

This is how httplib tells us that it tried to reuse an existing connection but it was already closed by the server. In that case, yes, we would like to retry.

So the original motivation wasn't to actually deal with malformed responses from Arvados API servers. It was to try to patch over a bug in httplib. Which we're generally not using anymore.

After that came 31eb1bdc31e1d030844a6fdc7f4ba4286ec79d4f (#7587), which roughly seems to be the patching over the httplib2 version of this bug: it catches socket.error instead of anything under httplib.

After that came 8a0eb69984a93852ec888cd3e02b778b0be758ed (#7697), which limited the retry logic to certain kinds of requests and made it retry multiple times with exponential timeout. It switched from catching httplib.BadStatusLine to httplib.HTTPException, apparently just because that seemed like a nice generalization. There's no comment, commit message, or ticket to suggest we saw a problem we actually wanted to address this way. The same commit also added code to proactively close long-idle sockets—which with httplib2 seems like a more robust way to handle the original problem than all the exception handling stuff.

With all that investigation, I am convinced that all this retry logic in _intercept_http_request is basically historical cruft. All the exception handling that's currently in there was trying to do the same thing that the proactive socket closing code does now. And all the retry logic is there basically just to support that.

In my opinion, we should remove all this exception retry logic from _intercept_http_request as redundant, and switch to the patching strategy I proposed in note-17. It'll be net less code for us to maintain; it'll avoid the problem of two retry stacks interacting with each other as they do now; and it means clients will get the benefits of Google's more robust detection of what cases are or aren't retryable.

Actions #23

Updated by Brett Smith over 1 year ago

  • Story points set to 2.0
  • Target version changed from Future to Development 2023-05-10 sprint
  • Status changed from New to In Progress

One point to implement, document, and test. One point to update all our client tools to use it.

Actions #24

Updated by Brett Smith over 1 year ago

The branch is working except lib/controller/integration_test.go TestContainerInputOnDifferentCluster fails by timing out. I have been experimenting with different retry settings to try to bisect the problem. If I set all the retry settings to 1, the test passes, so I think code is basically just retrying too much for the test, I don't think this is a situation where one component crashes and causes the stack to hang. Furthermore, if I initialize all the clients with num_retries=1 but then set args.retries = 10, the test goes back to failing. This suggests that the failure is not caused by new retry code, but by increasing the number of retries on an operation that's already retried.

By forcing the FUSE mount to write a debug log, I get the following from it:

2023-05-05 16:53:33 arvados.arvados_fuse[107911] ERROR: arv-mount 8ea82759a04e5ce160fb3b0c7f771b1d+48: error
Traceback (most recent call last):
  File "/home/brett/.cache/arvados-test/VENV3DIR/lib/python3.9/site-packages/arvados/api.py", line 116, in _intercept_http_request
    response, body = self.orig_http_request(uri, method, headers=headers, **kwargs)
  File "/home/brett/.cache/arvados-test/VENV3DIR/lib/python3.9/site-packages/httplib2/__init__.py", line 1711, in request
    (response, content) = self._request(
  File "/home/brett/.cache/arvados-test/VENV3DIR/lib/python3.9/site-packages/httplib2/__init__.py", line 1427, in _request
    (response, content) = self._conn_request(conn, request_uri, method, body, headers)
  File "/home/brett/.cache/arvados-test/VENV3DIR/lib/python3.9/site-packages/httplib2/__init__.py", line 1349, in _conn_request
    conn.connect()
  File "/home/brett/.cache/arvados-test/VENV3DIR/lib/python3.9/site-packages/httplib2/__init__.py", line 1185, in connect
    raise socket_err
  File "/home/brett/.cache/arvados-test/VENV3DIR/lib/python3.9/site-packages/httplib2/__init__.py", line 1139, in connect
    sock.connect((self.host, self.port))
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/brett/.cache/arvados-test/VENV3DIR/lib/python3.9/site-packages/arvados_fuse/fusedir.py", line 541, in update
    coll_reader = arvados.collection.CollectionReader(
  File "/home/brett/.cache/arvados-test/VENV3DIR/lib/python3.9/site-packages/arvados/collection.py", line 1927, in __init__
    super(CollectionReader, self).__init__(manifest_locator_or_text, *args, **kwargs)
  File "/home/brett/.cache/arvados-test/VENV3DIR/lib/python3.9/site-packages/arvados/collection.py", line 1353, in __init__
    self._populate()
  File "/home/brett/.cache/arvados-test/VENV3DIR/lib/python3.9/site-packages/arvados/collection.py", line 1471, in _populate
    self._populate_from_api_server()
  File "/home/brett/.cache/arvados-test/VENV3DIR/lib/python3.9/site-packages/arvados/collection.py", line 1454, in _populate_from_api_server
    self._remember_api_response(self._my_api().collections().get(
  File "/home/brett/.cache/arvados-test/VENV3DIR/lib/python3.9/site-packages/googleapiclient/_helpers.py", line 130, in positional_wrapper
    return wrapped(*args, **kwargs)
  File "/home/brett/.cache/arvados-test/VENV3DIR/lib/python3.9/site-packages/googleapiclient/http.py", line 923, in execute
    resp, content = _retry_request(
  File "/home/brett/.cache/arvados-test/VENV3DIR/lib/python3.9/site-packages/arvados/api.py", line 87, in _retry_request
    response, body = _orig_retry_request(http, num_retries, *args, **kwargs)
  File "/home/brett/.cache/arvados-test/VENV3DIR/lib/python3.9/site-packages/googleapiclient/http.py", line 222, in _retry_request
    raise exception
  File "/home/brett/.cache/arvados-test/VENV3DIR/lib/python3.9/site-packages/googleapiclient/http.py", line 191, in _retry_request
    resp, content = http.request(uri, method, *args, **kwargs)
  File "/home/brett/.cache/arvados-test/VENV3DIR/lib/python3.9/site-packages/arvados/api.py", line 132, in _intercept_http_request
    raise type(e)(*e.args)
ConnectionRefusedError: [Errno 111] [req-r8kp6ootctrqdr3umsuc] Connection refused

Full log attached. Note this ran all the integration tests, so that may account for additional logging.

I have not fully dug in, this is my current working theory based on impressions: It looks like this test might bring up "partial" clusters that are running some services but not all. The client may be trying to talk to an API server that is configured in the federation but not actually running. When that fails, it's a retryable error, and now it's trying to retry for much longer than it used to before.

If I'm right about that, the best fix would be to have this federated search in one larger retry loop, where it queries all servers immediately, then retries any retryable errors. We already have a similar strategy for finding Keep blocks on Keep services.

However, that would be another fix added to the branch, and it's not clear we actually need it. If "partial" clusters like this are not actually meant to be supported, it would probably be faster to just have the test spin up more services, so this lookup can get a permanent failure and move on promptly.

Actions #25

Updated by Brett Smith over 1 year ago

Other logs reveal the failing request is https://127.0.0.22:33075/arvados/v1/collections/8ea82759a04e5ce160fb3b0c7f771b1d%2B48?alt=json. With low retries, the ultimate exception is:

2023-05-09 09:15:46 arvados.arvados_fuse[624446] ERROR: arv-mount 8ea82759a04e5ce160fb3b0c7f771b1d+48: error
Traceback (most recent call last):
  File "/home/brett/.cache/arvados-test/VENV3DIR/lib/python3.9/site-packages/arvados_fuse/fusedir.py", line 541, in update
    coll_reader = arvados.collection.CollectionReader(
  File "/home/brett/.cache/arvados-test/VENV3DIR/lib/python3.9/site-packages/arvados/collection.py", line 1927, in __init__
    super(CollectionReader, self).__init__(manifest_locator_or_text, *args, **kwargs)
  File "/home/brett/.cache/arvados-test/VENV3DIR/lib/python3.9/site-packages/arvados/collection.py", line 1353, in __init__
    self._populate()
  File "/home/brett/.cache/arvados-test/VENV3DIR/lib/python3.9/site-packages/arvados/collection.py", line 1471, in _populate
    self._populate_from_api_server()
  File "/home/brett/.cache/arvados-test/VENV3DIR/lib/python3.9/site-packages/arvados/collection.py", line 1454, in _populate_from_api_server
    self._remember_api_response(self._my_api().collections().get(
  File "/home/brett/.cache/arvados-test/VENV3DIR/lib/python3.9/site-packages/googleapiclient/_helpers.py", line 130, in positional_wrapper
    return wrapped(*args, **kwargs)
  File "/home/brett/.cache/arvados-test/VENV3DIR/lib/python3.9/site-packages/googleapiclient/http.py", line 938, in execute
    raise HttpError(resp, content, uri=self.uri)
arvados.errors.ApiError: <HttpError 502 when requesting https://127.0.0.22:42435/arvados/v1/collections/8ea82759a04e5ce160fb3b0c7f771b1d%2B48?alt=json returned "errors: [cannot use a locally issued token to forward a request to our login cluster (z1111) request failed: https://127.0.0.33:43311/arvados/v1/collections/8ea82759a04e5ce160fb3b0c7f771b1d+48?forwarded_for=z2222-: 401 Unauthorized: //railsapi.internal/arvados/v1/collections/8ea82759a04e5ce160fb3b0c7f771b1d+48?forwarded_for=z3333-z2222-: 401 Unauthorized: Not logged in (req-mfmbszjg88528erovp0x)]">

So this all makes sense. FUSE makes an API request that cannot succeed because it is not authorized. Controller turns this into a 502, so FUSE understandably retries it. With additional retries as specified in this ticket, the test now takes longer than expected and times out.

I think controller needs to use a non-retryable status code when all the underlying requests are non-retryable. The bigger question in my mind is, does that need to be done as part of this branch, to allow the new retry logic to work better? Or is it a separate issue? In other words, are we willing to ship increased default retries in the Python SDK given this controller misbehavior?

Actions #26

Updated by Brett Smith over 1 year ago

The test isn't even passing today. In current main it goes down this skip branch:

    if outcoll.UUID == "" {
        arvmountlog, err := fs.ReadFile(arvados.FS(logcfs), "/arv-mount.txt")
        c.Check(err, check.IsNil)
        c.Check(string(arvmountlog), check.Matches, `(?ms).*cannot use a locally issued token to forward a request to our login cluster \(z1111\).*`)
        c.Skip("this use case is not supported yet")
    }

Given that we have tests that document that this does not work, I am less worried about changing the behavior of PySDK. I think it's reasonable to change the test (probably to skip it completely, without even attempting the container).

Actions #27

Updated by Peter Amstutz over 1 year ago

As it happens we even already have a ticket Inappropriate auto-retry when looking up collection by PDH

Actions #28

Updated by Brett Smith over 1 year ago

  • Related to Bug #20425: Inappropriate auto-retry when looking up collection by PDH added
Actions #29

Updated by Brett Smith over 1 year ago

12684-pysdk-auto-retry @ e3962b41ad01780c1b173892f0a5c947449a0ed8 - developer-run-tests: #3648

Note especially the last commit, where mocking services was required to prevent Jenkins from spending 45 minutes retrying requests on a couple of tests. In general, any time our test suite uses mock services or patches out time.sleep, it will hide the potential impact of increasing the default retries. Because of that, I am concerned this branch could have far-reaching impact, creating delays where there were none before in ways that will be considered hangs or performance regressions by our users.

There are a few ways we could mitigate that. One would be testing a real workflow on a dev cluster and comparing runtimes against a previous run. Another would be going through the test suite and adding assertions that no retries occurred in cases where they weren't expected. This is not trivial since different tests use different mock/patch strategies, so the implementation will vary. On the plus side, they would be valuable assertions to have going forward. But it would add substantial work to a branch that has already taken more work than maybe anticipated, so I don't want to invest that time without another dev's agreement.

Actions #30

Updated by Peter Amstutz over 1 year ago

_HTTP_CAN_RETRY = set([408, 409, 422, 423, 500, 502, 503, 504])

I don't think we want to retry the 4xx errors

408, 409 -- I would be very surprised if anything in our stack generates these, safer not to retry.

422 -- this happens when something fails in the process of handling the request, it can happen due to a bug or a deliberately thrown an error, but generally we should just assume trying again is unlikely to succeed

423 -- this client isn't talking WebDAV (and we presently don't support webdav locking, either), it would also be weird to see this, so it is safer not to retry

Actions #31

Updated by Brett Smith over 1 year ago

Peter Amstutz wrote in #note-30:

I don't think we want to retry the 4xx errors

I'm not super wedded to this but since this code has been around a while let me give the original rationale.

408, 409, and 423 can all be retried according to the HTTP/WebDAV RFCs. The feeling at the time was that if the API server returns one of these inappropriately, that would be an API server bug and should be dealt with there. People developing their own clients might reasonably expect those RFC semantics, after all. When you say it's "safer not to retry," can you sketch out a scenario or two you're worried about? And then 423 is included because thbe retry module aims to be useful for a variety of services, not just the API server.

422 is definitely not in that boat, but at the time it was not uncommon to see 422 responses when the API server was having trouble talking to its database, and we specifically were interested in retrying those requests. See the docstring for arvados.retry.check_http_response_success. I understand and agree it gets returned in a lot of cases where retrying is futile. Unfortunately the fact that it's the generic unhandled exception response code means it's hard to be sure which case we're in, and at the time we preferred to err on the side of retrying. But eight years is a long time. If those database hiccups are less of a concern these days, sure, let's remove it.

Actions #32

Updated by Peter Amstutz over 1 year ago

  • Target version changed from Development 2023-05-10 sprint to Development 2023-05-24 sprint
Actions #33

Updated by Tom Clegg over 1 year ago

my 2¢

408 "server would like to shut down this unused connection" sounds retryable to me

409 "request conflict with the current state of the target resource" seems like it could go either way -- the easiest example to find is "uploading a file that is older than the existing one" which sounds futile to retry, but it also sounds like a response status that might be used for "connect to SSH service on a container that isn't running yet" which sounds very retryable

422 is the default 4xx code used by rails for "bad request", which is usually not retryable, and MDN specifically puts a warning on this particular status The client should not repeat this request without modification so I'm inclined to say our client should not

If Rails returns 422 for a server-side error like "trouble talking to its database" I think we'll get better mileage fixing railsapi than treating 422 as retryable.

Actions #34

Updated by Tom Clegg over 1 year ago

e3962b41ad01780c1b173892f0a5c947449a0ed8

I thought retrying 403 sounded fishy, so I looked into the googleapiclient source and found where it only gets retried if the response body looks like a google rateLimitedExceeded error (as you alluded to above). Maybe worth adding 403 to the test_4xx_not_retried list?

Re retrying 422, see #note-33 above. git grep -w 422 services/api shows many checks for 422 on validation errors. It actually seems a little surprising we've made it this far retrying validation errors. Maybe partly because people don't tend to type execute(num_retries=...) until they get a thing working and then have it fail on a transient error later?

Rest LGTM, thanks for the comments above and helpful commit messages.

I am concerned this branch could have far-reaching impact, creating delays where there were none before in ways that will be considered hangs or performance regressions by our users.

The googleapiclient library has a LOGGER.warning() before sleep-and-retry, which would presumably help with this, but do they actually show up anywhere?

Actions #35

Updated by Brett Smith over 1 year ago

Tom Clegg wrote in #note-34:

I thought retrying 403 sounded fishy, so I looked into the googleapiclient source and found where it only gets retried if the response body looks like a google rateLimitedExceeded error (as you alluded to above). Maybe worth adding 403 to the test_4xx_not_retried list?

Added.

Re retrying 422, see #note-33 above. git grep -w 422 services/api shows many checks for 422 on validation errors. It actually seems a little surprising we've made it this far retrying validation errors. Maybe partly because people don't tend to type execute(num_retries=...) until they get a thing working and then have it fail on a transient error later?

That doesn't happen today because the retry 422 logic isn't hooked up to googleapiclient requests at all. The only code that retries 422 is code that specifically calls arvados.retry.check_http_response_success, which in current main is three functions in arvados that deal with job/task status (which presumably nobody is using anymore) and two in KeepClient (which does return 422 for a hash mismatch but maybe that's not applicable to the requests being retried).

Plus even if we did retry 422 responses, we usually do a small number of retries. So the usual outcome would be your code pauses for a second or two before you get the exception. That would be hard to notice if you weren't looking for it.

Removed all 422 retry logic. Now at 5d0b205148026b44c7f1bd30448c5429d0918d16 - developer-run-tests: #3652

The googleapiclient library has a LOGGER.warning() before sleep-and-retry, which would presumably help with this, but do they actually show up anywhere?

Functionally no, because all of our tools only set up logging for the arvados logging hierarchy. I'm very open to improving this but I think it should be a separate ticket because I think we should discuss how we want to go about it.

Actions #36

Updated by Tom Clegg over 1 year ago

That all makes sense, thanks. LGTM.

Actions #37

Updated by Brett Smith over 1 year ago

  • Status changed from In Progress to Resolved
Actions #38

Updated by Peter Amstutz over 1 year ago

  • Release set to 64
Actions #39

Updated by Brett Smith over 1 year ago

  • Related to Idea #20521: Python tools emit logs from googleapiclient added
Actions #40

Updated by Brett Smith over 1 year ago

  • Related to Idea #8539: [SDKs/FUSE] Better retry defaults added
Actions

Also available in: Atom PDF