Project

General

Profile

Actions

Bug #18346

closed

Login federation: request storm overwhelming login cluster rails api server

Added by Peter Amstutz over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Story points:
-
Release relationship:
Auto

Description

A customer has seen this behavior in 2 different scenarios:

a) when a user used an old token that was issued by a local cluster prior to the migration to a login federation. Local cluster and login cluster on Arvados 2.2.2
b) when a big workflow is run on a 2.3.0 cluster with the login cluster on 2.2.2

The b) case appears to be a 2.3 regression: the workflow that triggered the outage is a re-run that did not cause problems on Arvados 2.2.x (or older, that's not clear).

The requests that end up at the login cluster api server have a specific request parameter pattern (include_trash=true&select=[uuid]). They seem to be user and collection requests.

The collection requests seem to be for log collections (i.e. the workflow steps writing to them, presumably?).

The requests all get a 401 response from the login cluster api server, but this does not appear to impede the running of the big workflow on the local cluster.

The customer implemented a workaround: greatly increasing the number of passenger workers on the login cluster api server made it able to handle many more concurrent requests (and return a 401 for them), which avoids the overload death spiral when clients retry.


Subtasks 4 (0 open4 closed)

Task #18351: Review 18346-container-tokenResolvedTom Clegg11/10/2021Actions
Task #18365: build 2.3.1~rc1 with bugfixResolvedPeter Amstutz11/10/2021Actions
Task #18366: Review 18346-crunchrun-no-eventsResolvedPeter Amstutz11/10/2021Actions
Task #18373: merge fixes into 2.3.1Resolved11/10/2021Actions

Related issues

Related to Arvados - Bug #18887: [federation] wb1 fiddlesticks in login federationResolvedWard Vandewege03/25/2022Actions
Actions

Also available in: Atom PDF