Bug #21287
Updated by Peter Amstutz 11 months ago
Originally from: https://dev.arvados.org/issues/21285#note-2 In order to service a request, controller can do a number of things: # Forward it to the local Rails API server # Handle it entirely within controller (by querying the local database itself) # Query another service (keep-web, or a crunch-run process on a compute node) # Query another Arvados instance (federated queries) In the 3rd or 4th cases, we don't have full control over what the other service is going to do -- but we have existing patterns in the keep-web and federated cases where the remote service will query back to our controller in order to verify an API token, retrieve a user record, or get other data. We've specifically observed this with keep-web, where: # the Workbench 2 process page sends requests for all the log collection files at once # this hits controller's request limit # keep-web sends a request back to verify a token # the request to verify the token is stuck behind the outstanding requests that were proxied to keep-web, that are waiting on keep-web, that is waiting on the token verify # the system is deadlocked until something times out The current fix is to make sure the minimum request limit is high enough that we don't do this to ourselves. We could get into a similar situation with federation, but an even simpler problem is one where the remote service is in a slow or broken (or malicious state) where it is a tar pit that causes queries to hang for a long time. If the queue is filled with outstanding requests, the system will become unusable. (Of course, this is also possible with slow Rails/database requests, but the sysadmin has more control over those). I propose a config limit MaxProxiedRequests MaxExternalRequests (name is up for discussion) that limits the number of category 3 or 4 requests such that requests in category 1 or 2 can still be processed. Exact implementation TBD.