Bug #13933

crunch-dispatch-slurm / Go SDK Dispatcher performs poorly in the presence of a large backlog

Added by Joshua Randall almost 2 years ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Start date:
11/13/2018
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-
Release:
Release relationship:
Auto

Description

When there is a large backlog of queued containers, crunch-dispatch-slurm takes a long time to process them, and can be very sensitive to (even transient) API server issues.

For example, we currently have ~37000 containers in state Queued with priority > 0. crunch-dispatch-slurm requests these from the API server in batches of 100 (also each time through the loop wastefully asking for a count of items available for which the database takes around the same amount of time as the request itself). The API server also makes a pre-flight check of the size of the mounts in the container records, so to fulfill each one of these batches of 100 the database gets queried three times with the same conditions (but different select values) and at ~3x the time cost. Changing the dispatcher code so that it does a `limit: 0` count request at the beginning and then does `count: none` requests in each loop iteration improves performance significantly. Changing the API server so that it does not check the size of the mounts fields when the limits are already at the minimum (100 seems to be the minimum?) could yield an additional 50% speedup on these queries.

If any one of the (in our case ~370) batch list requests to the API server fails for any reason, crunch-dispatch-slurm (really the Go SDK Dispatcher) gives up and starts again from the beginning (N.B. it doesn't even log a warning in this situation for some reason). The code path here is that checkForUpdates returns false at https://github.com/curoverse/arvados/blob/master/sdk/go/dispatch/dispatch.go#L172 which then triggers the `if !querySuccess { continue }` block at either https://github.com/curoverse/arvados/blob/master/sdk/go/dispatch/dispatch.go#L100 or https://github.com/curoverse/arvados/blob/master/sdk/go/dispatch/dispatch.go#L128. In an environment with a large backlog and a nonzero API server error rate, this makes it difficult to reach the later stages of the Run() function. I don't have a solution to suggest for this, but I think it would be helpful at a minimum if both of those continue blocks logged a message indicating that not all containers were retrieved from the API server successfully so that operators have a chance to notice the problem.


Subtasks

Task #14052: ReviewClosedLucas Di Pentima

Associated revisions

Revision 70e5c7a3
Added by Tom Clegg almost 2 years ago

Merge branch '13933-dispatch-batch-size'

closes #13933

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

Revision d7b63b69 (diff)
Added by Tom Clegg almost 2 years ago

13933: Update error message expectation.

refs #13933

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

History

#1 Updated by Joshua Randall almost 2 years ago

  • Category set to Crunch

After implementing the initial count check with `limit: 0` (which takes ~10s on our system at present) and subsequent `count: none` on each loop iteration, on our system each batch of 100 is taking ~18s to come back (so, ~5.5 per second). Prior to the `count: none` fix, it was taking nearly 30s per batch of 100.

Changing the loop to use `limit: 1000` instead of the default 100 results in each batch of 1000 taking ~22s (so, ~45.5 per second).

Changing the loop to use `limit: 10000` results in each batch of 10000 taking ~46s to come back (so, ~217 per second).

I cannot test higher than this as our backlog was cleared pretty quickly when running with limit 10000 (this would have taken two hours longer to clear with the default 100 limit - although in actuality under our current conditions it would never clear the backlog as our system is submitting containers faster than it is possible for c-d-s to process them in batches of 100).

I would suggest making the batch size in c-d-s configurable, and/or using a larger default.

#2 Updated by Joshua Randall almost 2 years ago

Incidentally, with batch size 10000 the processing time to clear our queued container backlog was:
10s get items available matching filters (fixed)
46s get batch of 10000 containers (~217 per second)
1398s lock and submit 10000 containers (~7 per second)
48s get batch of 10000 containers (~208 per second)
1032s lock and submit 10000 containers (~10 per second)
(followed by small batches)

Overall performance of c-d-s could potentially be further improved by having multiple worker goroutines handle the locking and starting of containers concurrently.

#3 Updated by Tom Morris almost 2 years ago

  • Target version set to To Be Groomed

#5 Updated by Tom Morris almost 2 years ago

  • Target version changed from To Be Groomed to 2018-09-05 Sprint

#6 Updated by Tom Clegg almost 2 years ago

  • Assigned To set to Tom Clegg

#7 Updated by Tom Clegg almost 2 years ago

  • Status changed from New to Resolved
  • % Done changed from 0 to 100

#8 Updated by Ward Vandewege almost 2 years ago

  • Release set to 13

Also available in: Atom PDF