Project

General

Profile

Actions

Bug #13305

open

keepstore starts returning 401 Unauthorized in the middle of a workflow

Added by Joshua Randall about 6 years ago. Updated about 2 months ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
Story points:
-
Release:
Release relationship:
Auto

Description

I have no idea why this suddenly started happening, but it is causing lots of problems with our production system. Example log excerpt:

2018-03-31T22:14:11.567344145Z 2018-03-31 22:14:11 arvados.arvfile[13083] ERROR: Exception doing block prefetch
2018-03-31T22:14:11.567344145Z Traceback (most recent call last):
2018-03-31T22:14:11.567344145Z File "/usr/lib/python2.7/dist-packages/arvados/arvfile.py", line 600, in _block_prefetch_worker
2018-03-31T22:14:11.567344145Z self._keep.get(b)
2018-03-31T22:14:11.567344145Z File "/usr/lib/python2.7/dist-packages/arvados/retry.py", line 162, in num_retries_setter
2018-03-31T22:14:11.567344145Z return orig_func(self, *args, **kwargs)
2018-03-31T22:14:11.567344145Z File "/usr/lib/python2.7/dist-packages/arvados/keep.py", line 972, in get
2018-03-31T22:14:11.567344145Z return self._get_or_head(loc_s, method="GET", **kwargs)
2018-03-31T22:14:11.567344145Z File "/usr/lib/python2.7/dist-packages/arvados/keep.py", line 1099, in _get_or_head
2018-03-31T22:14:11.567344145Z "failed to read {}".format(loc_s), service_errors, label="service")
2018-03-31T22:14:11.567344145Z KeepReadError: failed to read 5b56e0116fce5730cfe21ef7d8c8ddac+868894+A2034e59c3283856f660928dceda2b4b19abc16a7@5ac00827: service http://127.0.0.1:25107/ responded with 401 HTTP/1.1 401 Unauthorized

Actions #1

Updated by Tom Morris about 6 years ago

Is this an intermittent error? How often does it occur? Was it's appearance correlated with any changes in the cluster? e.g. software versions?

Actions #2

Updated by Tom Clegg about 6 years ago

Block locator is 5b56e0116fce5730cfe21ef7d8c8ddac+868894+A2034e59c3283856f660928dceda2b4b19abc16a7@5ac00827

$ TZ=UTC date -Ins -d @`printf %d 0x5ac00827`
2018-03-31T22:13:59,000000000+00:00

compare timestamp:

2018-03-31T22:14:11.567344145Z KeepReadError: ... 401

Looks like blobSignatureTTL seconds elapsed between loading the collection and reading the data.

Two approaches to fix this:
  1. Increase blobSignatureTTL. It must be identical in API server config and all keepstore configs, and (unfortunately) any running process will start getting 401 immediately when you change blobSignatureTTL, so you basically need to stop the world to change it.
  2. Make clients refresh signatures by re-fetching the collection record. IIRC arv-mount does this automatically but there isn't (yet) an especially convenient way for native clients to do this.
Actions #3

Updated by Tom Clegg about 6 years ago

We could also improve the error reporting here.
  • The client could check the timestamp before even making the request, and fail with an "expired token" message if it's expired.
  • The client could check the timestamp after receiving a 401 response, in case the timestamp expired while the request was in flight.
  • keepstore and keepproxy could return a more specific message ("expired signature") in the status line or response body -- although clients can't necessarily be counted on to propagate this.
Actions #4

Updated by Peter Amstutz about 1 year ago

  • Release set to 60
Actions #5

Updated by Peter Amstutz about 2 months ago

  • Target version set to Future
Actions

Also available in: Atom PDF