Project

General

Profile

Actions

Feature #8457

open

[Keep] Shuffle top N keep servers to balance reads

Added by Peter Amstutz almost 9 years ago. Updated 10 months ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
Keep
Target version:
Story points:
-
Release:
Release relationship:
Auto

Description

Currently, keep client always orders the server list the same way on a per-block basis. If compute nodes are all requesting the same file at once (a common event when starting a large run), this can lead to load on the keepstore at the top of the list for each block while other servers with the same block are not utilized.

Since blocks are typically replicated, we could shuffle the top N services (where N is the greater of the replication count for the block, and the number of Keep readers we're willing to run simultaneously). This will spread out the load in a properly replicated and balanced cluster as different clients will use slightly different priority orders for requesting blocks.

Each SDK should have a function "the maximum number of simultaneous workers, based on the desired replication level and the characteristics of the underlying Keep services." (The Python SDK has this code inside ThreadLimiter.__init__; it can be refactored out independently.) The result of that function should also be used to determine how many services to shuffle for this story.

Actions #1

Updated by Peter Amstutz almost 9 years ago

  • Description updated (diff)
Actions #2

Updated by Brett Smith almost 9 years ago

If we decide to do this it should probably get split into separate stories for the Go and Python SDKs. The code change seems pretty straightforward but there will definitely be test impacts.

Actions #3

Updated by Brett Smith almost 9 years ago

  • Description updated (diff)
  • Category set to Keep

Updated to account for non-disk services. In deployments where the underlying storage volume handles replication more than Keep itself, shuffling based on the replication level alone is likely to hurt performance, by generating requests that are unlikely to succeed.

Actions #4

Updated by Peter Amstutz almost 2 years ago

  • Release set to 60
Actions #5

Updated by Peter Amstutz 10 months ago

  • Target version set to Future
Actions

Also available in: Atom PDF