Bug #18346

Updated by Ward Vandewege 3 months ago

For recording findings related to https://support.curii.com/rt/Ticket/Display.html?id=254

A customer has seen this behavior in 2 different scenarios:

a) when a user used an old token that was issued by a local cluster prior to the migration to a login federation. Local cluster and login cluster on Arvados 2.2.2
b) when a big workflow is run on a 2.3.0 cluster with the login cluster on 2.2.2

The b) case appears to be a 2.3 regression: the workflow that triggered the outage is a re-run that did not cause problems on Arvados 2.2.x (or older, that's not clear). regression.

The requests that end up at the login cluster api server have a specific request parameter pattern (include_trash=true&select=[uuid]). They seem to be user and collection requests.

The collection requests seem to be for log collections (i.e. the workflow steps writing to them, presumably?).

The requests all get a 401 response from the login cluster api server, but this does not appear to impede the running of the big workflow on the local cluster.

The customer implemented a workaround: greatly increasing the number of passenger workers on the login cluster api server made it able to handle many more concurrent requests (and return a 401 for them), which avoids the overload death spiral when clients retry.