Feature #20601
closed
Optimize arvados-dispatch-cloud performance when queue is very large
Added by Tom Clegg over 1 year ago.
Updated over 1 year ago.
Release relationship:
Auto
Description
Possible improvements include
- When fetching queued containers, sort by priority, and stop when reaching expected capacity
- Un-select
mounts
field when re-fetching a container whose mounts field we already have
- Re-fetch container records less frequently when busy (but still frequently check for new queued containers if we have capacity to run them)
- Target version changed from Future to Development 2023-06-21 sprint
- Assigned To set to Tom Clegg
- Status changed from New to In Progress
20601-big-ctr-queue @ b6103b727f2ffeae0d9a4702c158842df71a2128 -- developer-run-tests: #3699
In order of expected impact, we have:
- Don't fetch all the "mounts" fields on every poll. Unselect when getting lists, then do one additional API call to get that field for each new container. This increases the total number of API calls, but makes the majority of the calls significantly cheaper when the mounts field is large.
- When there are thousands of queued containers, don't fetch multiple pages -- just one page of the highest priority new containers. (More precisely: only fetch enough pages [which may have as many as 1000 containers, it's up to RailsAPI] to see the 100 highest priority queued non-supervisor containers.)
- Between queue polls, wait at least PollInterval and at least as long as the last poll took to complete. This avoids saturating RailsAPI with back-to-back requests when fetching the queue is slow due to either size or server load.
Tom Clegg wrote in #note-6:
20601-big-ctr-queue @ b312e657331120960dafb2e0b536f560e402486f
just adds a comment.
Is there a situation where a running container priority goes to 0 but it never notices because it is now always going to be at the end of the list?
... I guess you answered that in the last commit where it explicitly requests the states of "missing" containers.
What happens when the dispatcher restarts? Does it get container uuids from the running nodes and then look them up?
Otherwise, this LGTM.
Peter Amstutz wrote in #note-7:
Is there a situation where a running container priority goes to 0 but it never notices because it is now always going to be at the end of the list?
... I guess you answered that in the last commit where it explicitly requests the states of "missing" containers.
Yes, exactly.
What happens when the dispatcher restarts? Does it get container uuids from the running nodes and then look them up?
If a crunch-run process is already running when dispatcher starts up, its container must already be in Locked state (or further along), so the "locked by me" or "missing" queries will pick it up.
Tom Clegg wrote in #note-8:
If a crunch-run process is already running when dispatcher starts up, its container must already be in Locked state (or further along), so the "locked by me" or "missing" queries will pick it up.
Great.
A new bug I noticed. This seems as good a branch as any to fix it.
maxSupervisors is calculated one time as MaxInstances * SupervisorFraction
.
The problem is, if maxConcurrency
gets adjusted downward, we could get into a situation where maxSupervisors > maxConcurrency
, which would result in starvation.
So, I think runQueue() needs to tweaked to recalculate maxSupervisors
based on min(maxInstances, maxConcurrency) * SupervisorFraction
.
- Status changed from In Progress to Resolved
Also available in: Atom
PDF