Bug #13305
open
keepstore starts returning 401 Unauthorized in the middle of a workflow
Added by Joshua Randall almost 7 years ago.
Updated 11 months ago.
Release relationship:
Auto
Description
I have no idea why this suddenly started happening, but it is causing lots of problems with our production system. Example log excerpt:
2018-03-31T22:14:11.567344145Z 2018-03-31 22:14:11 arvados.arvfile[13083] ERROR: Exception doing block prefetch
2018-03-31T22:14:11.567344145Z Traceback (most recent call last):
2018-03-31T22:14:11.567344145Z File "/usr/lib/python2.7/dist-packages/arvados/arvfile.py", line 600, in _block_prefetch_worker
2018-03-31T22:14:11.567344145Z self._keep.get(b)
2018-03-31T22:14:11.567344145Z File "/usr/lib/python2.7/dist-packages/arvados/retry.py", line 162, in num_retries_setter
2018-03-31T22:14:11.567344145Z return orig_func(self, *args, **kwargs)
2018-03-31T22:14:11.567344145Z File "/usr/lib/python2.7/dist-packages/arvados/keep.py", line 972, in get
2018-03-31T22:14:11.567344145Z return self._get_or_head(loc_s, method="GET", **kwargs)
2018-03-31T22:14:11.567344145Z File "/usr/lib/python2.7/dist-packages/arvados/keep.py", line 1099, in _get_or_head
2018-03-31T22:14:11.567344145Z "failed to read {}".format(loc_s), service_errors, label="service")
2018-03-31T22:14:11.567344145Z KeepReadError: failed to read 5b56e0116fce5730cfe21ef7d8c8ddac+868894+A2034e59c3283856f660928dceda2b4b19abc16a7@5ac00827: service http://127.0.0.1:25107/ responded with 401 HTTP/1.1 401 Unauthorized
Is this an intermittent error? How often does it occur? Was it's appearance correlated with any changes in the cluster? e.g. software versions?
Block locator is 5b56e0116fce5730cfe21ef7d8c8ddac+868894+A2034e59c3283856f660928dceda2b4b19abc16a7@5ac00827
$ TZ=UTC date -Ins -d @`printf %d 0x5ac00827`
2018-03-31T22:13:59,000000000+00:00
compare timestamp:
2018-03-31T22:14:11.567344145Z KeepReadError: ... 401
Looks like blobSignatureTTL seconds elapsed between loading the collection and reading the data.
Two approaches to fix this:
- Increase blobSignatureTTL. It must be identical in API server config and all keepstore configs, and (unfortunately) any running process will start getting 401 immediately when you change blobSignatureTTL, so you basically need to stop the world to change it.
- Make clients refresh signatures by re-fetching the collection record. IIRC arv-mount does this automatically but there isn't (yet) an especially convenient way for native clients to do this.
We could also improve the error reporting here.
- The client could check the timestamp before even making the request, and fail with an "expired token" message if it's expired.
- The client could check the timestamp after receiving a 401 response, in case the timestamp expired while the request was in flight.
- keepstore and keepproxy could return a more specific message ("expired signature") in the status line or response body -- although clients can't necessarily be counted on to propagate this.
- Target version set to Future
Also available in: Atom
PDF