Feature #20601
closedOptimize arvados-dispatch-cloud performance when queue is very large
Description
- When fetching queued containers, sort by priority, and stop when reaching expected capacity
- Un-select
mounts
field when re-fetching a container whose mounts field we already have - Re-fetch container records less frequently when busy (but still frequently check for new queued containers if we have capacity to run them)
Updated by Peter Amstutz over 1 year ago
- Target version changed from Future to Development 2023-06-21 sprint
Updated by Tom Clegg over 1 year ago
20601-big-ctr-queue @ 61b45d1413d21a4f6a0fd6449680f0d03eac1d88 -- developer-run-tests: #3697
Updated by Tom Clegg over 1 year ago
20601-big-ctr-queue @ b6103b727f2ffeae0d9a4702c158842df71a2128 -- developer-run-tests: #3699
In order of expected impact, we have:- Don't fetch all the "mounts" fields on every poll. Unselect when getting lists, then do one additional API call to get that field for each new container. This increases the total number of API calls, but makes the majority of the calls significantly cheaper when the mounts field is large.
- When there are thousands of queued containers, don't fetch multiple pages -- just one page of the highest priority new containers. (More precisely: only fetch enough pages [which may have as many as 1000 containers, it's up to RailsAPI] to see the 100 highest priority queued non-supervisor containers.)
- Between queue polls, wait at least PollInterval and at least as long as the last poll took to complete. This avoids saturating RailsAPI with back-to-back requests when fetching the queue is slow due to either size or server load.
Updated by Tom Clegg over 1 year ago
20601-big-ctr-queue @ b312e657331120960dafb2e0b536f560e402486f
just adds a comment.
Updated by Peter Amstutz over 1 year ago
Tom Clegg wrote in #note-6:
20601-big-ctr-queue @ b312e657331120960dafb2e0b536f560e402486f
just adds a comment.
Is there a situation where a running container priority goes to 0 but it never notices because it is now always going to be at the end of the list?
... I guess you answered that in the last commit where it explicitly requests the states of "missing" containers.
What happens when the dispatcher restarts? Does it get container uuids from the running nodes and then look them up?
Otherwise, this LGTM.
Updated by Tom Clegg over 1 year ago
Peter Amstutz wrote in #note-7:
Is there a situation where a running container priority goes to 0 but it never notices because it is now always going to be at the end of the list?
... I guess you answered that in the last commit where it explicitly requests the states of "missing" containers.
Yes, exactly.
What happens when the dispatcher restarts? Does it get container uuids from the running nodes and then look them up?
If a crunch-run process is already running when dispatcher starts up, its container must already be in Locked state (or further along), so the "locked by me" or "missing" queries will pick it up.
Updated by Peter Amstutz over 1 year ago
Tom Clegg wrote in #note-8:
If a crunch-run process is already running when dispatcher starts up, its container must already be in Locked state (or further along), so the "locked by me" or "missing" queries will pick it up.
Great.
A new bug I noticed. This seems as good a branch as any to fix it.
maxSupervisors is calculated one time as MaxInstances * SupervisorFraction
.
The problem is, if maxConcurrency
gets adjusted downward, we could get into a situation where maxSupervisors > maxConcurrency
, which would result in starvation.
So, I think runQueue() needs to tweaked to recalculate maxSupervisors
based on min(maxInstances, maxConcurrency) * SupervisorFraction
.
Updated by Tom Clegg over 1 year ago
20601-supervisor-fraction @ 69148e44f11b50f1c4b1c1bc4c1871ab10c3e893 -- developer-run-tests: #3704
Updated by Peter Amstutz over 1 year ago
Tom Clegg wrote in #note-10:
20601-supervisor-fraction @ 69148e44f11b50f1c4b1c1bc4c1871ab10c3e893 -- developer-run-tests: #3704
Scheduler.maxSupervisors
is a ratio but then the variable maxSupervisors
in runQueue()
is a count. This is confusing.
Scheduler.maxSupervisors
should be renamed Scheduler.supervisorFraction
Rest LGTM.
Updated by Peter Amstutz over 1 year ago
- Status changed from In Progress to Resolved
I made the change and merged it (9075ef17087f7b3bdce308bfa5d60b4fe3863b51), so this is resolved, thanks.