Project

General

Profile

Actions

Feature #20601

closed

Optimize arvados-dispatch-cloud performance when queue is very large

Added by Tom Clegg over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Story points:
-
Release relationship:
Auto

Description

Possible improvements include
  • When fetching queued containers, sort by priority, and stop when reaching expected capacity
  • Un-select mounts field when re-fetching a container whose mounts field we already have
  • Re-fetch container records less frequently when busy (but still frequently check for new queued containers if we have capacity to run them)

Subtasks 1 (0 open1 closed)

Task #20628: Review 20601-big-ctr-queueResolvedTom Clegg06/13/2023Actions
Actions #1

Updated by Peter Amstutz over 1 year ago

  • Target version changed from Future to Development 2023-06-21 sprint
Actions #2

Updated by Peter Amstutz over 1 year ago

  • Assigned To set to Tom Clegg
Actions #3

Updated by Tom Clegg over 1 year ago

  • Status changed from New to In Progress
Actions #5

Updated by Tom Clegg over 1 year ago

20601-big-ctr-queue @ b6103b727f2ffeae0d9a4702c158842df71a2128 -- developer-run-tests: #3699

In order of expected impact, we have:
  • Don't fetch all the "mounts" fields on every poll. Unselect when getting lists, then do one additional API call to get that field for each new container. This increases the total number of API calls, but makes the majority of the calls significantly cheaper when the mounts field is large.
  • When there are thousands of queued containers, don't fetch multiple pages -- just one page of the highest priority new containers. (More precisely: only fetch enough pages [which may have as many as 1000 containers, it's up to RailsAPI] to see the 100 highest priority queued non-supervisor containers.)
  • Between queue polls, wait at least PollInterval and at least as long as the last poll took to complete. This avoids saturating RailsAPI with back-to-back requests when fetching the queue is slow due to either size or server load.
Actions #6

Updated by Tom Clegg over 1 year ago

20601-big-ctr-queue @ b312e657331120960dafb2e0b536f560e402486f

just adds a comment.

Actions #7

Updated by Peter Amstutz over 1 year ago

Tom Clegg wrote in #note-6:

20601-big-ctr-queue @ b312e657331120960dafb2e0b536f560e402486f

just adds a comment.

Is there a situation where a running container priority goes to 0 but it never notices because it is now always going to be at the end of the list?

... I guess you answered that in the last commit where it explicitly requests the states of "missing" containers.

What happens when the dispatcher restarts? Does it get container uuids from the running nodes and then look them up?

Otherwise, this LGTM.

Actions #8

Updated by Tom Clegg over 1 year ago

Peter Amstutz wrote in #note-7:

Is there a situation where a running container priority goes to 0 but it never notices because it is now always going to be at the end of the list?

... I guess you answered that in the last commit where it explicitly requests the states of "missing" containers.

Yes, exactly.

What happens when the dispatcher restarts? Does it get container uuids from the running nodes and then look them up?

If a crunch-run process is already running when dispatcher starts up, its container must already be in Locked state (or further along), so the "locked by me" or "missing" queries will pick it up.

Actions #9

Updated by Peter Amstutz over 1 year ago

Tom Clegg wrote in #note-8:

If a crunch-run process is already running when dispatcher starts up, its container must already be in Locked state (or further along), so the "locked by me" or "missing" queries will pick it up.

Great.

A new bug I noticed. This seems as good a branch as any to fix it.

maxSupervisors is calculated one time as MaxInstances * SupervisorFraction.

The problem is, if maxConcurrency gets adjusted downward, we could get into a situation where maxSupervisors > maxConcurrency, which would result in starvation.

So, I think runQueue() needs to tweaked to recalculate maxSupervisors based on min(maxInstances, maxConcurrency) * SupervisorFraction.

Actions #11

Updated by Peter Amstutz over 1 year ago

Tom Clegg wrote in #note-10:

20601-supervisor-fraction @ 69148e44f11b50f1c4b1c1bc4c1871ab10c3e893 -- developer-run-tests: #3704

Scheduler.maxSupervisors is a ratio but then the variable maxSupervisors in runQueue() is a count. This is confusing.

Scheduler.maxSupervisors should be renamed Scheduler.supervisorFraction

Rest LGTM.

Actions #12

Updated by Peter Amstutz over 1 year ago

  • Status changed from In Progress to Resolved

I made the change and merged it (9075ef17087f7b3bdce308bfa5d60b4fe3863b51), so this is resolved, thanks.

Actions #13

Updated by Peter Amstutz over 1 year ago

  • Release set to 66
Actions

Also available in: Atom PDF