Bug #20474: Ensuring metrics are available during busy times - Arvados

Actions

Copy link

Bug #20474

closed

Ensuring metrics are available during busy times

Added by Peter Amstutz almost 2 years ago. Updated over 1 year ago.

Status:

Resolved

Priority:

Normal

Assigned To:

Lucas Di Pentima

Category:

Target version:

Development 2023-06-07

Story points:

0.5

Release:

Arvados 2.7.0

Release relationship:

Auto

Description

Nginx configuration should be API server request limit + a little extra, because the metrics requests are not subject to the API request limit but we need to leave some headroom.

Subtasks 1 (0 open — 1 closed)

Actions

Copy link

Updated by Peter Amstutz almost 2 years ago

Status changed from New to In Progress

Actions

Copy link

Updated by Peter Amstutz almost 2 years ago

Description updated (diff)

Actions

Copy link

Updated by Tom Clegg almost 2 years ago

Assigned To set to Tom Clegg

Actions

Copy link

Updated by Tom Clegg almost 2 years ago

Assigned To changed from Tom Clegg to Lucas Di Pentima

Added a check in 3faf9753f to confirm that service processes continue to handle incoming /metrics and /_inspect/requests when normal requests are being rejected due to MaxConcurrentRequests.

Still need to check that the downstream Nginx config isn't preventing them from getting through to the arvados service processes in the first place.

Actions

Copy link

Updated by Peter Amstutz almost 2 years ago

Target version changed from Development 2023-05-10 sprint to Development 2023-05-24 sprint

Actions

Copy link

Updated by Lucas Di Pentima almost 2 years ago

Updates at e8a6dc77f - branch 20474-installer-api-reqs-limit

These changes affect the multi host salt installer:

Adds 5 additional requests to RailsAPI's request queue when CONTROLLER_MAX_CONCURRENT_REQUESTS is set on local.params.
Also, does some basic type checking.

I've done manual testing on a test cluster to confirm this works.

Actions

Copy link

Updated by Lucas Di Pentima almost 2 years ago

Story points set to 0.5

Actions

Copy link

Updated by Peter Amstutz almost 2 years ago

Target version changed from Development 2023-05-24 sprint to Development 2023-06-07

Actions

Copy link

Updated by Brett Smith almost 2 years ago

Lucas Di Pentima wrote in #note-6:

Updates at e8a6dc77f - branch 20474-installer-api-reqs-limit

I'm not convinced we need this. My current understanding is:

passenger_max_request_queue_size only affects services that have Passenger enabled.
The only server that has passenger_enabled on is RailsAPI. No other service sets that, and the default is off, so they should be unaffected by Passenger or its configuration at all.
I don't see us setting any other nginx configuration that would limit incoming connections.

That said, this is all new to me, so maybe there's something I'm missing? Did you specifically test before the branch whether a change is necessary to allow metrics connections beyond MaxConcurrentRequests? Can you say more about the testing process in general?

Thanks.

Actions

Copy link

#10

Updated by Lucas Di Pentima almost 2 years ago

My understanding is the following:

The MaxConcurrentRequests config knob controls the controller service request queue size, while the passenger_max_request_queue_size config from nginx controls the same thing on the RailsAPI side.
As the majority of the request that comes from clients on the controller service translates to a subsequent request from controller to RailsAPI, this was set on a 1:1 ratio, but if we also want to make additional requests directly to RailsAPI (for other purposes like retrieving its metrics), we need some slack on RailsAPI's request queue, so that when controller's getting hammered by clients and cannot queue any more requests, RailsAPI can accept some additional ones.

Does this make sense or maybe I'm misunderstanding your feedback?

Actions

Copy link

#11

Updated by Brett Smith almost 2 years ago

Lucas Di Pentima wrote in #note-10:

As the majority of the request that comes from clients on the controller service translates to a subsequent request from controller to RailsAPI, this was set on a 1:1 ratio, but if we also want to make additional requests directly to RailsAPI (for other purposes like retrieving its metrics), we need some slack on RailsAPI's request queue, so that when controller's getting hammered by clients and cannot queue any more requests, RailsAPI can accept some additional ones.

Does this make sense or maybe I'm misunderstanding your feedback?

I think we're not on the same page about the functional requirements of this ticket. The devs made a change to ensure that requests to /metrics and /_inspect/requests specifically are not counted against MaxConcurrentRequests. My understanding is this ticket is about making sure that nginx does not do its own enforcement of that limit.

I'm saying, I think that's already the case in current main. RailsAPI does not support either of these endpoints, controller handles them internally, so there is no need to increase the Passenger queue size limit to accommodate these specific requests. And I don't see anything else in the current nginx configuration that would put a limit on controller or other servers that do support the endpoint: Passenger queue size limit shouldn't affect other servers, and there's no other nginx configuration that would limit incoming requests.

If you understood the ticket differently we might need to get clarification from Peter or Tom.

Actions

Copy link

#12

Updated by Brett Smith almost 2 years ago

Brett Smith wrote in #note-11:

RailsAPI does not support either of these endpoints, controller handles them internally, so there is no need to increase the Passenger queue size limit to accommodate these specific requests.

This is what the Admin guide docs say but those docs are out-of-date. RailsAPI support for /metrics was added a year ago. With that, the changes in the branch make sense, and then my comments are:

Adding a comment near the Passenger request queue limit to explain that we add this overhead for metrics seems like a nice affordance, given how much time we've spent on it and it could easily just seem like "generic overhead" otherwise.

I don't feel too strongly about this, I wouldn't hold up a merge over it, but I do think it would be nice to set the overhead as a percentage of max_reqs rather than a static 5, that way it scales nicely to different sizes of cluster. If it was me I'd probably do 10%? But anything 5%-25% seems reasonable.

Thanks.

Actions

Copy link

#13