Project

General

Profile

Actions

Idea #13697

closed

Prevent the API server and database from continuing to serve requests to clients after timeout

Added by Joshua Randall almost 6 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
API
Target version:
Story points:
0.5
Release relationship:
Auto

Description

A typical nginx configuration for the API server is to time out requests after 5m (300s). However, passenger/rails currently do not notice that the client has given up on the request and continue to process it. On some occasions (when things go wrong in other ways) we have had requests running for hours, holding locks that prevent other parts of the system from working, and in the worst cases, effecting a DoS on the entire API server by consuming all available passenger workers in the process.

There is no point in processing any part of a request for longer than the request timeout. An easy win to prevent the above scenario would be to set a postgres statement timeout to the same length as the nginx gateway timeout.

Our system has been running with a 300s statement_timeout for a few weeks without issue: https://github.com/wtsi-hgi/arvados/commit/d9728e17148db53caf1f16fce032448c3d5c1432

Probably the value used for the timeout should come from config rather than being hard-coded, so that admins can configure it appropriately when non-standard nginx configuration is used.


Subtasks 1 (0 open1 closed)

Task #18156: Review 13697-database-timeoutResolvedTom Clegg09/23/2021Actions

Related issues

Related to Arvados - Bug #18184: Flaky test due to bug in old version of singularityResolvedTom Clegg09/24/2021Actions
Actions #1

Updated by Tom Morris about 5 years ago

  • Target version set to To Be Groomed
Actions #2

Updated by Tom Morris about 5 years ago

  • Story points set to 0.5
Actions #3

Updated by Tom Morris about 5 years ago

  • Target version changed from To Be Groomed to Arvados Future Sprints
Actions #4

Updated by Peter Amstutz almost 3 years ago

  • Target version deleted (Arvados Future Sprints)
Actions #5

Updated by Joshua Randall over 2 years ago

I am once again running into problems with API server resource starvation due to long-running requests causing normal clients with reasonable timeouts and retry behaviour essentially denial-of-service attacking the server (with, in this case, large collection creation requests).

IMHO it is not possible to run a stable Arvados API server under heavy load without applying a patch like the one that was implemented at Sanger to introduce statement timeouts in the database to mitigate this issue (see: https://github.com/wtsi-hgi/hgi-systems/blob/master/ansible/roles/arvados-master/files/hgi-integration-1.1.4.20180723133344-8.patch). Apologies that the OP for this issue did not include the patch!

Definitions of "heavy load" may vary depending on the capabilities of the API server and database hosts (and passenger settings such as the number of workers), but based on prior experience, the sorts of things that may take a long time (longer than the timeout) and lead to client-retry DoS attacks are:
- attempts to create very large collections (I think the thing that takes a long time is validating manifest block signatures)
- attempts to read very large collections with signed manifests (unsigned manifests are fast)
- attempts to work with container requests with large inline input documents
- certain queries against the log table, when it has grown very large

Actions #6

Updated by Peter Amstutz over 2 years ago

  • Target version set to 2021-09-29 sprint
Actions #7

Updated by Peter Amstutz over 2 years ago

  • Assigned To set to Tom Clegg
Actions #8

Updated by Tom Clegg over 2 years ago

  • Status changed from New to In Progress

13697-database-timeout @ fdd3dd304439a7912db1c6ef257cf512ec7b3b52 -- developer-run-tests: #2707

  • Set the database statement and lock timeouts to RequestTimeout (similar to the contributed patch, but implemented in the "db connection checkout" hook instead of the ArvadosModel initializer which runs each time an object is created/loaded)
  • In Go services that use lib/service (including controller and keepstore, but not yet keepproxy or keep-web), the request context is cancelled if the handler doesn't finish or hijack the connection before reaching RequestTimeout.
  • When keep-web gets migrated to use lib/service, this won't cut off long/slow downloads (the webdav handler doesn't interrupt copy-bytes-to-ResponseWriter when context cancels) but we might (?) need some more special handling so large/slow uploads are allowed to take longer than RequestTimeout.
  • Remove httpserver.HandlerWithContext in favor of the BaseContext feature already provided by http.Server.
Actions #9

Updated by Ward Vandewege over 2 years ago

Tom Clegg wrote:

13697-database-timeout @ fdd3dd304439a7912db1c6ef257cf512ec7b3b52 -- developer-run-tests: #2707

  • Set the database statement and lock timeouts to RequestTimeout (similar to the contributed patch, but implemented in the "db connection checkout" hook instead of the ArvadosModel initializer which runs each time an object is created/loaded)
  • In Go services that use lib/service (including controller and keepstore, but not yet keepproxy or keep-web), the request context is cancelled if the handler doesn't finish or hijack the connection before reaching RequestTimeout.
  • When keep-web gets migrated to use lib/service, this won't cut off long/slow downloads (the webdav handler doesn't interrupt copy-bytes-to-ResponseWriter when context cancels) but we might (?) need some more special handling so large/slow uploads are allowed to take longer than RequestTimeout.

For this, would it be worth putting a note along these lines (with a reference to this ticket) in the webdav source for when we refactor it to use lib/service? That way it won't be overlooked.

  • Remove httpserver.HandlerWithContext in favor of the BaseContext feature already provided by http.Server.

Nice!

I see a singularity related test failure at developer-run-tests-remainder: #2814 /consoleFull, is that unrelated?

LGTM otherwise!

Actions #10

Updated by Tom Clegg over 2 years ago

Ward Vandewege wrote:

I see a singularity related test failure at developer-run-tests-remainder: #2814 /consoleFull, is that unrelated?

Hm, the same failure happened to an #8363 branch as well, and went away on the next attempt.

executor_test.go:166:
    c.Check(code, Equals, expectCode)
... obtained int = 151
... expected int = 0
151 is SIGURG. This seems to be a singularity bug, reported in 3.5.2 and fixed April 2020. The Go runtime (starting at/before 1.14) sends itself SIGURG to preempt goroutines, and these are forwarded to child processes by default.

Perhaps the fix is to update arvados-server install from singularity 3.5.2 to 3.7.4 (source:tools/compute-images/scripts/base.sh already uses 3.7.4).

Added #18184

Actions #11

Updated by Ward Vandewege over 2 years ago

Tom Clegg wrote:

Ward Vandewege wrote:

I see a singularity related test failure at developer-run-tests-remainder: #2814 /consoleFull, is that unrelated?

Hm, the same failure happened to an #8363 branch as well, and went away on the next attempt.

[...]

151 is SIGURG. This seems to be a singularity bug, reported in 3.5.2 and fixed April 2020. The Go runtime (starting at/before 1.14) sends itself SIGURG to preempt goroutines, and these are forwarded to child processes by default.

Perhaps the fix is to update arvados-server install from singularity 3.5.2 to 3.7.4 (source:tools/compute-images/scripts/base.sh already uses 3.7.4).

Good sleuthing! Yes please upgrade to 3.7.4, that is the minimum version we are intending to support.

Actions #12

Updated by Tom Clegg over 2 years ago

  • Related to Bug #18184: Flaky test due to bug in old version of singularity added
Actions #13

Updated by Tom Clegg over 2 years ago

  • Status changed from In Progress to Resolved
Actions #14

Updated by Ward Vandewege over 2 years ago

  • Release set to 42
Actions

Also available in: Atom PDF