Project

General

Profile

Actions

Idea #20602

closed

Prioritize requests made by workbench 2

Added by Peter Amstutz about 1 year ago. Updated 11 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
API
Start date:
06/08/2023
Due date:
Story points:
-
Release relationship:
Auto

Description

Use Origin header to distinguish requests coming from workbench2 (list of "interactive" origins should be configurable to accommodate future applications that are interactive and talk to API server).

Traffic from interactive applications (workbench2) is prioritized, we want to avoid returning API errors and return responses as quickly as possible, at the expense of "general admission" requests from non-interactive API clients and services.

Keep-web should also use/pass through Origin header to API to indicate when a request is coming from workbench2 and should be handled at interactive priority (i.e. when getting a collection listing).

To reduce queuing time the "general admission" queue should be small provide backpressure with 503 responses, because they are assumed to be fine to wait and retry on 503.

Then the UI queue (higher priority) can be a bit larger to avoid API errors, but by having the general admission queue be smaller, the UI requests avoid being stuck behind a large queue of general admission requests.

Something like

40 workers

general admission queue is 2x = 80 -> gets 503 errors much sooner

prioritized is 4x = 160 (or higher)

alternately --

As each connection comes in, read headers and assign a priority, put in a priority queue

Rework how controller forwards connections to passenger/rails

Limit concurrent forwarded requests to Rails == worker pool

As a worker becomes free, pull the next request from the priority queue -- UI requests will sort before general admission requests.

Also - need to look at passenger behavior & settings related to priority and detecting when workers are busy.


Subtasks 1 (0 open1 closed)

Task #20623: Review 20602-controller-qosResolvedPeter Amstutz06/08/2023Actions

Related issues

Related to Arvados - Bug #20533: Better handling of request surges when canceling a large workflowResolvedPeter AmstutzActions
Related to Arvados Epics - Idea #20599: Scaling to 1000s of concurrent containersResolved06/01/202303/31/2024Actions
Actions #1

Updated by Peter Amstutz about 1 year ago

  • Description updated (diff)
  • Subject changed from Distinguish between requests that will be retries and those that won't to Distinguish requests made by workbench 2
Actions #2

Updated by Peter Amstutz about 1 year ago

  • Related to Bug #20533: Better handling of request surges when canceling a large workflow added
Actions #3

Updated by Peter Amstutz about 1 year ago

Passenger notes

We use passenger_concurrency_model: process because thread is only available in the proprietary enterprise product.

Each application process only has a single thread and can only handle 1 request at a time.

https://www.phusionpassenger.com/docs/references/config_reference/nginx/#passenger_concurrency_model

There is passenger_force_max_concurrent_requests_per_process

Passenger's highly optimized load balancer assumes that Ruby apps can handle 1 (or thread-limited amount of) concurrent connection(s)

The special case is having multiple long-lived connections (examples being websockets or server sent events), but we don't do that from our Rails server.

So this confirms 1 worker = 1 request up to passenger_max_pool_size

Then there is passenger_max_request_queue_size

When all application processes are already handling their maximum number of concurrent requests, Passenger will queue all incoming requests. This option specifies the maximum size for that queue. If the queue is already at this specified limit, then Passenger will immediately send a "503 Service Unavailable" error to any incoming requests. You may use passenger_request_queue_overflow_status_code to customize the response status.

A value of 0 means that the queue is unbounded.

So it makes sense to set passenger's own queue size to a small value and handle prioritization as much as possible in controller.

Another thought, a prioritization strategy in controller could let us do weighted priorities where UI requests are selected 2/3 of the time and general admission requests are 1/3 of the time.

I think that even if we are prioritizing on a per-request basis, which allows us to bump UI requests to the front of the line, we still need separate limits so that a queue full of general admission requests don't mean we stop accepting UI requests.

Actions #4

Updated by Peter Amstutz about 1 year ago

From discussion: weighted handling doesn't really make sense, if we can't even handle load from UI requests then everyone is going to have a bad time.

Proposed implementation is to existing rate limiting and introduce queuing based on prioritization.

Maximum active requests is proportional to passenger workers (eg scaled by count of workers like 1.1 or 1.2)

Actions #5

Updated by Peter Amstutz about 1 year ago

  • Related to Idea #20599: Scaling to 1000s of concurrent containers added
Actions #6

Updated by Peter Amstutz about 1 year ago

  • Subject changed from Distinguish requests made by workbench 2 to Prioritize requests made by workbench 2
Actions #7

Updated by Peter Amstutz about 1 year ago

  • Target version changed from Future to Development 2023-06-21 sprint
Actions #8

Updated by Peter Amstutz about 1 year ago

  • Target version changed from Development 2023-06-21 sprint to Development 2023-06-07
Actions #9

Updated by Peter Amstutz about 1 year ago

  • Assigned To set to Tom Clegg
Actions #10

Updated by Tom Clegg about 1 year ago

  • Status changed from New to In Progress
Actions #11

Updated by Peter Amstutz about 1 year ago

  • Target version changed from Development 2023-06-07 to Development 2023-06-21 sprint
Actions #12

Updated by Tom Clegg about 1 year ago

New config:

      # Maximum number of incoming requests to hold in a priority
      # queue waiting for one of the MaxConcurrentRequests slots to be
      # free. When the queue is longer than this, respond 503 to the
      # lowest priority request.
      #
      # If MaxQueuedRequests is 0, respond 503 immediately to
      # additional requests while at the MaxConcurrentRequests limit.
      MaxQueuedRequests: 64

I think we will want to merge the new feature and try out a few values of MaxConcurrentRequests and MaxQueuedRequests and then consider changing the defaults accordingly. In the meantime, 0 would be the same behavior we had before, but 64 seems pretty safe.

The data structure choice (heap) is debatable. Push and pop are O(logN), but dropping the lowest-priority request from the queue is O(N) (N = queue size). I think it's OK to optimize for success here. If the queue is large, clients will be timing out and disconnecting on their own before falling off the end. If the queue is short, O(N) isn't too bad.

"Lock container" requests are ineligible for the queue. The idea is that once controller/rails hits capacity, dispatchers should immediately stop trying to start new containers.

New metrics: arvados_queued_requests, arvados_max_queued_requests

20602-controller-qos @ 110ce6a7e9c55fca0e2d43f8629be73fc0f8ba25 -- developer-run-tests: #3694

Actions #13

Updated by Peter Amstutz about 1 year ago

I haven't looked at the branch yet, but will there be a way, using metrics, to distinguish the volume of interactive requests from general requests? If I just see a single full request queue I won't be able to tell if it's fine (interactive requests are getting handled promptly, non-interactive requests are backing off) or on the edge of failure (the queue is full of interactive requests and the system is on the brink of failure).

Actions #14

Updated by Tom Clegg about 1 year ago

Peter Amstutz wrote in #note-13:

will there be a way, using metrics, to distinguish the volume of interactive requests from general requests?

Currently there is not. Should this be a blocker?

Perhaps the metric we want here is a summary (histogram etc) of time spent in the queue by interactive requests. We would expect 0 when MaxConcurrentRequests isn't reached yet, then a short duration when we're successfully prioritizing interactive traffic, then a long duration when MaxConcurrentRequests isn't even enough to handle the interactive traffic.

Perhaps two "time spent in queue" summary metrics -- one for interactive, one for non-interactive -- would be a good start?

Actions #15

Updated by Peter Amstutz about 1 year ago

Tom Clegg wrote in #note-14:

Peter Amstutz wrote in #note-13:

will there be a way, using metrics, to distinguish the volume of interactive requests from general requests?

Currently there is not. Should this be a blocker?

Yes.

Perhaps the metric we want here is a summary (histogram etc) of time spent in the queue by interactive requests. We would expect 0 when MaxConcurrentRequests isn't reached yet, then a short duration when we're successfully prioritizing interactive traffic, then a long duration when MaxConcurrentRequests isn't even enough to handle the interactive traffic.

Perhaps two "time spent in queue" summary metrics -- one for interactive, one for non-interactive -- would be a good start?

The thing I want to see is how close to falling over we are. Because if we're 100% busy but 75% of capacity is spent handling interactive requests, it is likely we urgently need to deal with it, but if 25% of capacity is interactive and 75% is general, that's probably fine (backpressure will slow down the rest of the system and hopefully reach equilibrium).

That's the metric I want to see, I don't know if "time in queue" will give me that information.

As I said I haven't had a chance to dig into the implementation yet so there might also be some hidden assumptions you or I have about what it should do / does do.

Actions #16

Updated by Tom Clegg about 1 year ago

If you define "falling over" as "workbench gets 503" then yes, "% of queue used by interactive requests" would tell you that. When it gets too high, you can always "fix" it by making the queue bigger.

But if you define "falling over" as "workbench gets 503 or timeouts, or is just super slow" -- which can't be fixed just by making the queue longer -- the number of queued requests doesn't tell you much. I think response time is more useful for this. Currently, we have timing stats for all requests, but without a metric that puts them in separate buckets, it will be hard to see the difference between "fast interactive + slow batch" (prioritizing is working) and "slow interactive + slow batch" (prioritization is insufficient, need more resources).

I thought "queue time for interactive requests" might be good because (unlike "ttfb for interactive requests") if it's not already 0, you can always reduce it by adding backend slots.

Actions #17

Updated by Peter Amstutz about 1 year ago

Tom Clegg wrote in #note-16:

If you define "falling over" as "workbench gets 503" then yes, "% of queue used by interactive requests" would tell you that. When it gets too high, you can always "fix" it by making the queue bigger.

Yes, that is exactly how I define it. Users seeing 503 errors is really, really, really, really bad and the reason this ticket is high priority (pun not intended), so we need to a way to know if we're getting dangerously close to that happening.

From standup discussion: either a count of requests, or a % of requests that are interactive vs all requests is fine, we need know when we're maxing out on requests if a significant % of those are interactive. The base assumption is that the cluster must have enough capacity to handle the interactive load, backpressure to slow down clients isn't a solution for humans.

But if you define "falling over" as "workbench gets 503 or timeouts, or is just super slow" -- which can't be fixed just by making the queue longer -- the number of queued requests doesn't tell you much. I think response time is more useful for this. Currently, we have timing stats for all requests, but without a metric that puts them in separate buckets, it will be hard to see the difference between "fast interactive + slow batch" (prioritizing is working) and "slow interactive + slow batch" (prioritization is insufficient, need more resources).

I thought "queue time for interactive requests" might be good because (unlike "ttfb for interactive requests") if it's not already 0, you can always reduce it by adding backend slots.

I think we want both. You're right, if average queue time or response time is high that's also an indication we need to throw more resources at the cluster, but it's less of an emergency as users getting back 503 errors.

Actions #18

Updated by Peter Amstutz about 1 year ago

20602-controller-qos @ 110ce6a7e9c55fca0e2d43f8629be73fc0f8ba25

The Go standard library has a heap package https://pkg.go.dev/container/heap

But the standard library package doesn't seem to do much -- the example is almost as long as the from-scratch implementation you made (and suffers greatly from the fact that Go doesn't have generics). That said, I'm wondering if there was a particular reason to not use it, or you just overlooked it -- might be worth a comment for the future. (but if the code works that's here already, there's no reason to rewrite it).

I'm going to assume the heap is correct without checking your homework (I'd have to blow off the dust from my algorithms book) but it might benefit from a few unit tests.

You already made a note about adjusting the defaults for MaxConcurrentRequests and MaxQueuedRequests. I suspect we're going to want a smaller MaxConcurrentRequests (which is really proportional to the number of passenger workers, so something like workers * 1.5) and a greater MaxQueuedRequests (maybe 100 or 200), but we can do some experiments. Another reason to have metrics to get some feedback about what is happening inside controller.

Publishing metrics that are just the count of how many requests of each priority are in the queue at that moment shouldn't be so hard, right? That's basically what queued_requests already does, just with an added pass through the
list to count things up.

This metric would let us see that if interactive request count appeared as > 0 consistently in the queued requests graph, that would suggest that that general requests may be getting starved, and the cluster needs more resources.

I see how publishing a metric of what is currently being handled would be a little more inconvenient, although that would let us see if there is a risk of interactive requests starving the system before it actually starts to happen.

Style nit: the priorities being MinInt64, 0, 1, 2 reads a little strangely, why not use -1 instead of MinInt64? Or go 0, 1, 2, 3? Or declare them as constants, or an enum.

I really want to see this in action so I would be alright with a preliminary merge (and keep the ticket open) that allows us to do some testing on the scale cluster.

Actions #19

Updated by Tom Clegg about 1 year ago

Peter Amstutz wrote in #note-18:

I'm wondering if there was a particular reason to not use it, or you just overlooked it

Overlooked. Indeed, it only accounts for half the code, but I figure it's still better form to use it. Updated.

I'm going to assume the heap is correct without checking your homework (I'd have to blow off the dust from my algorithms book) but it might benefit from a few unit tests.

This seems like a nice benefit of using stdlib.

Publishing metrics that are just the count of how many requests of each priority are in the queue at that moment shouldn't be so hard, right? That's basically what queued_requests already does, just with an added pass through the list to count things up.

Yes, the only little complication is that RequestLimiter doesn't know the possible priority values. To avoid overthinking it, we could adjust the values a bit and use 4 buckets: {don't-queue}, neg, 0, pos.

This metric would let us see that if interactive request count appeared as > 0 consistently in the queued requests graph, that would suggest that that general requests may be getting starved, and the cluster needs more resources.

I don't think that's quite what it would mean. If I saw interactive request count = 100 and average time spent in queue = 10 ms, I don't think I would recommend more resources to fix it. But if count = 40 and time spent in queue = 4 s, I would.

I see how publishing a metric of what is currently being handled would be a little more inconvenient, although that would let us see if there is a risk of interactive requests starving the system before it actually starts to happen.

I'm thinking time spent in queue would isolate the queue effect from the time it takes to actually handle the requests, which can vary a lot based on the type of request.

Style nit: the priorities being MinInt64, 0, 1, 2 reads a little strangely, why not use -1 instead of MinInt64? Or go 0, 1, 2, 3? Or declare them as constants, or an enum.

Ah yes. Made a const for that.

I really want to see this in action so I would be alright with a preliminary merge (and keep the ticket open) that allows us to do some testing on the scale cluster.

Sounds good

Actions #21

Updated by Peter Amstutz about 1 year ago

Tom Clegg wrote in #note-20:

20602-controller-qos @ b30e070d2cb99ccbe60eb9684bf0627dcbc3779b -- developer-run-tests: #3698

This LGTM

Actions #22

Updated by Peter Amstutz about 1 year ago

Preliminary tests show this approach working very well. Even fully loaded, Workbench 2 is more responsive than ever, while the rest of the system smoothly adjusts to discover the maximum capacity without falling off a cliff.

Actions #23

Updated by Peter Amstutz about 1 year ago

Now trying to figure out good installer settings to inform defaults.

Initial try.

With a 2 core/8 GiB node (m5.large) it can sustain about 120 containers, and is about 90% busy. That's with 4 nginx workers, and MaxConcurrentRequests of 8.

I tried a queue size of both 128 and 256. The different queue sizes doesn't have an obvious impact, but we don't have metrics to measure queue volume or wait yet.

MaxConcurrentRequests must not exceed passenger_max_request_queue_size or passenger will produce 503 errors

The ruby processes each consume about 40% CPU (this makes sense since there are 4 of them on 2 cores).

Workbench continues to work well.

Actions #24

Updated by Peter Amstutz about 1 year ago

I tried this with 2 workers and queue size of 64. This put the host at about 80% busy but my general sense is that it leads to slightly worse utilization.

It is a little hard to judge because I am looking at "max concurrent containers" as one of the metrics and that has been pretty much stuck at 126 concurrent containers, which means it is continuing to see 503 errors.

update: after making another configuration change going from 2 workers / queue size 64 back to 4 workers / queue size 256 (from the previous note), the max concurrent containers just adjusted to 141.

I'm trying to decide if this behavior of rejecting lock requests if they can't be processed immediately is really what we want. I wonder if pushing them to the back of the queue would have a similar effect under load but without needing for dispatcher to "get lucky" when it sends the lock request. What I'm seeing right now is that it hits a ceiling where it won't start any more containers because the API server is perpetually slightly backlogged. I think maybe that's actually exactly the situation we do want to stop adding work to the pile.

I saw the max supervisors bug in the wild, I had 142 workflow runners going and none of them were making any progress.

Actions #25

Updated by Tom Clegg about 1 year ago

Peter Amstutz wrote in #note-24:

API server is perpetually slightly backlogged. I think maybe that's actually exactly the situation we do want to stop adding work to the pile.

Right. Fast queue is good, slow queue is bad, no queue at all means server is not well utilized.

Perhaps instead of "ineligible for queue" we want either
  • "ineligible for queue when expected queue wait time > {threshold}", where expected queue wait time is the time spent in the queue by the last request that successfully moved from queue to processing, or
  • "503 when this request reaches {threshold} seconds in queue"

The threshold could be provided by a-d-c in a request header.

Actions #26

Updated by Peter Amstutz about 1 year ago

Tom Clegg wrote in #note-25:

Peter Amstutz wrote in #note-24:

API server is perpetually slightly backlogged. I think maybe that's actually exactly the situation we do want to stop adding work to the pile.

Right. Fast queue is good, slow queue is bad, no queue at all means server is not well utilized.

Perhaps instead of "ineligible for queue" we want either
  • "ineligible for queue when expected queue wait time > {threshold}", where expected queue wait time is the time spent in the queue by the last request that successfully moved from queue to processing, or
  • "503 when this request reaches {threshold} seconds in queue"

The threshold could be provided by a-d-c in a request header.

I believe that even the tiny node I'm testing with could probably support even more containers if the decision to accept or reject a lock request was based on throughput over a short period of time instead of an instantaneous accept/reject. This is based on an earlier test (prior to the prioritization branch being implemented) where it got up to 200 containers (of course, workbench 2 was unusable). With the prioritization branch being more conservative, it levels off around 120 - 140.

Some combination of putting lock requests at the back of the queue and giving them a short expiration time (like 1-2 seconds) might reflect the loaded capacity of the system a little better.

Actions #27

Updated by Tom Clegg about 1 year ago

Peter Amstutz wrote in #note-23:

I tried a queue size of both 128 and 256. The different queue sizes doesn't have an obvious impact, but we don't have metrics to measure queue volume or wait yet.

FWIW we do already have the queued_requests metric, which you know is effectively all low+normal priority if you know wb2 wasn't behaving badly.

Actions #28

Updated by Peter Amstutz about 1 year ago

Tom Clegg wrote in #note-27:

Peter Amstutz wrote in #note-23:

I tried a queue size of both 128 and 256. The different queue sizes doesn't have an obvious impact, but we don't have metrics to measure queue volume or wait yet.

FWIW we do already have the queued_requests metric, which you know is effectively all low+normal priority if you know wb2 wasn't behaving badly.

Ah, I overlooked that. I have been looking at the controller and dispatchcloud code, but the request limiter lives in the SDK. That will be helpful.

Actions #29

Updated by Tom Clegg about 1 year ago

20602-queue-metrics @ 1fd838ddcc3274448908227c285d1b7f233207d9 -- developer-run-tests: #3712

Adds
  • 2 second max duration a "lock container" request can spend in the queue (at that point it returns 503) -- configurable as API.MaxQueueTimeForLockRequests
  • adds queue_delay_seconds (time spent in queue before being accepted to a processing slot), summary for each of low/normal/high priority
  • adds queue_timeout_seconds (time spent in queue before being abandoned by client or timed out by rule), summary for each of low/normal/high priority
  • splits queued_requests into 3 gauges for low/normal/high priority.
Actions #30

Updated by Peter Amstutz about 1 year ago

Tom Clegg wrote in #note-29:

20602-queue-metrics @ 1fd838ddcc3274448908227c285d1b7f233207d9 -- developer-run-tests: #3712

Adds
  • 2 second max duration a "lock container" request can spend in the queue (at that point it returns 503) -- configurable as API.MaxQueueTimeForLockRequests
  • adds queue_delay_seconds (time spent in queue before being accepted to a processing slot), summary for each of low/normal/high priority
  • adds queue_timeout_seconds (time spent in queue before being abandoned by client or timed out by rule), summary for each of low/normal/high priority
  • splits queued_requests into 3 gauges for low/normal/high priority.

This looks good, let's merge it and run some more scale tests.

Actions #31

Updated by Peter Amstutz about 1 year ago

  • Status changed from In Progress to Resolved

In the course of testing this I hit https://dev.arvados.org/issues/20667

But I think this ticket is doing everything it was supposed to do.

Actions #32

Updated by Peter Amstutz 11 months ago

  • Release set to 66
Actions

Also available in: Atom PDF