Bug #21260
Updated by Peter Amstutz about 1 year ago
Relevant notes on #21160
https://dev.arvados.org/issues/21160#note-8
https://dev.arvados.org/issues/21160#note-10
Summary: controller enforces request timeout using a context (supposed to be API.RequestTimeout that defaults to 5 minutes but I am seeing the controller context expire after 1 minute -- might also be a bug?)
However, Rails / Postgres don't get any signal to stop processing. As a result the request continues processing (despite being cut loose by controller).
When controller cancels the session, the client gets 500 Internal Server Error. This is treated as a retryable response.
As a result, the client retries the expensive request _which is still running_, and the retry takes up a second request handler slot.
This can cascade with the retry timing out, blocked by the first request (if there are locks involved) resulting in another retry which ties up a third request handler slot, and so on.
To make the system more stable, we should have a mechanism that terminates long-running requests in Rails when they exceed a certain runtime and/or the client hangs up.