Bug #21287
Updated by Peter Amstutz 10 months ago
Originally from: https://dev.arvados.org/issues/21285#note-2 In order to service a request, controller can do a number of things: # Forward it to the local Rails API server # Handle it entirely within controller (by querying the local database itself) # Query another service (keep-web, or a crunch-run process on a compute node) # Query another Arvados instance (federated queries) In the 3rd or 4th cases, we don't have full control over what the other service is going to do -- but we have existing patterns in the keep-web and federated cases where the remote service will query back to our controller in order to verify an API token, retrieve a user record, or get other data. We've specifically observed this with keep-web, where: # the Workbench 2 process page sends requests for all the log collection files at once # this hits controller's request limit # keep-web sends a request back to verify a token # the request to verify the token is stuck behind the outstanding requests that were proxied to keep-web, that are waiting on keep-web, that is waiting on the token verify # the system is deadlocked until something times out The current fix is to make sure the minimum request limit is high enough that we don't do this to ourselves. We could get into a similar situation with federation, but an even simpler problem is one where the remote service is in a slow or broken (or malicious state) where it is a tar pit that causes queries to hang for a long time. If the queue is filled with outstanding requests, the system will become unusable. (Of course, this is also possible with slow Rails/database requests, but the sysadmin has more control over those). h2. Proposed solution Limit both incoming and outgoing requests. * determine request priority and timestamp for priority queue order * start handling up to MaxConcurrentRequests incoming requests in priority order, with throttling * when a request handler is going to make an outgoing request to Rails, acquire another throttled lock (up to MaxConcurrentRailsRequests) for that category of outgoing request ** the request acquires the rails lock in priority order * also want to bin requests into categories, eg ** requests that get information about the token, e.g current user or current token ** requests that proxy to keep-web ** container gateway requests (already implemented) ** everything else