Bug #13933
closedcrunch-dispatch-slurm / Go SDK Dispatcher performs poorly in the presence of a large backlog
Description
When there is a large backlog of queued containers, crunch-dispatch-slurm takes a long time to process them, and can be very sensitive to (even transient) API server issues.
For example, we currently have ~37000 containers in state Queued with priority > 0. crunch-dispatch-slurm requests these from the API server in batches of 100 (also each time through the loop wastefully asking for a count of items available for which the database takes around the same amount of time as the request itself). The API server also makes a pre-flight check of the size of the mounts in the container records, so to fulfill each one of these batches of 100 the database gets queried three times with the same conditions (but different select values) and at ~3x the time cost. Changing the dispatcher code so that it does a `limit: 0` count request at the beginning and then does `count: none` requests in each loop iteration improves performance significantly. Changing the API server so that it does not check the size of the mounts fields when the limits are already at the minimum (100 seems to be the minimum?) could yield an additional 50% speedup on these queries.
If any one of the (in our case ~370) batch list requests to the API server fails for any reason, crunch-dispatch-slurm (really the Go SDK Dispatcher) gives up and starts again from the beginning (N.B. it doesn't even log a warning in this situation for some reason). The code path here is that checkForUpdates returns false at https://github.com/curoverse/arvados/blob/master/sdk/go/dispatch/dispatch.go#L172 which then triggers the `if !querySuccess { continue }` block at either https://github.com/curoverse/arvados/blob/master/sdk/go/dispatch/dispatch.go#L100 or https://github.com/curoverse/arvados/blob/master/sdk/go/dispatch/dispatch.go#L128. In an environment with a large backlog and a nonzero API server error rate, this makes it difficult to reach the later stages of the Run() function. I don't have a solution to suggest for this, but I think it would be helpful at a minimum if both of those continue blocks logged a message indicating that not all containers were retrieved from the API server successfully so that operators have a chance to notice the problem.
Updated by Joshua Randall over 6 years ago
- Category set to Crunch
After implementing the initial count check with `limit: 0` (which takes ~10s on our system at present) and subsequent `count: none` on each loop iteration, on our system each batch of 100 is taking ~18s to come back (so, ~5.5 per second). Prior to the `count: none` fix, it was taking nearly 30s per batch of 100.
Changing the loop to use `limit: 1000` instead of the default 100 results in each batch of 1000 taking ~22s (so, ~45.5 per second).
Changing the loop to use `limit: 10000` results in each batch of 10000 taking ~46s to come back (so, ~217 per second).
I cannot test higher than this as our backlog was cleared pretty quickly when running with limit 10000 (this would have taken two hours longer to clear with the default 100 limit - although in actuality under our current conditions it would never clear the backlog as our system is submitting containers faster than it is possible for c-d-s to process them in batches of 100).
I would suggest making the batch size in c-d-s configurable, and/or using a larger default.
Updated by Joshua Randall over 6 years ago
Incidentally, with batch size 10000 the processing time to clear our queued container backlog was:
10s get items available matching filters (fixed)
46s get batch of 10000 containers (~217 per second)
1398s lock and submit 10000 containers (~7 per second)
48s get batch of 10000 containers (~208 per second)
1032s lock and submit 10000 containers (~10 per second)
(followed by small batches)
Overall performance of c-d-s could potentially be further improved by having multiple worker goroutines handle the locking and starting of containers concurrently.
Updated by Joshua Randall over 6 years ago
Updated by Tom Morris over 6 years ago
- Target version changed from To Be Groomed to 2018-09-05 Sprint
Updated by Tom Clegg over 6 years ago
- Status changed from New to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|70e5c7a3c6a5860d702d5e5c219dc0f3a3696d35.