Idea #7931
closed[keep-balance] Count block replication by volume IDs
Added by Brett Smith about 9 years ago. Updated over 6 years ago.
Description
Use the volume information added in #7928. Count the number of times a block is replicated for each volume it's stored on, rather than each Keepstore that reports it in the index.
Updated by Tom Clegg over 7 years ago
- Subject changed from [Data Manager] Count block replication by volume IDs to [keep-balance] Count block replication by volume IDs
Updated by Tom Clegg about 7 years ago
- Related to Feature #11184: [Keep] Support multiple storage classes added
Updated by Tom Morris almost 7 years ago
- Assigned To set to Tom Clegg
- Target version changed from To Be Groomed to 2018-03-14 Sprint
Updated by Tom Clegg almost 7 years ago
- Target version changed from 2018-03-14 Sprint to 2018-03-28 Sprint
Updated by Tom Clegg almost 7 years ago
7931-replicas-by-volume @ 741acb186f89237011fd7bf371218246d7f85403
Updated by Peter Amstutz almost 7 years ago
I understand we currently run keep in a high availability configuration, where each service has a primary writable mount, and has a secondary read only mount which is the primary mount of some other keepstore. So I'm assuming we want keep-balance to do the right thing on this configuration.
Generally, this doesn't go all the way to identifying replicas by mount+pdh, and still relies on pdh+mtime logic.
- Instead of using "same mtime" to decide if two replicas are actually the same, it should be using the mount uuid
- When there are multiple replicas on the same service, replicas on read only mounts should have precedence over writable mounts (since only the writable replicas can be trashed).
- When deciding whether to trash or keep a given replica, it checks read-only on a per-service basis, but not a per-mount basis.
- Trash and pull requests don't (can't?) target a particular mount (trashing targets the block by mtime). This seems like it would be useful for moving blocks between mounts of different storage classes.
Updated by Peter Amstutz almost 7 years ago
Also, it doesn't seem to be taking the underlying Replication count of the mount into account? So if it is located in an object storage account with Replication=3, and the block has requested replication=2, it would still store it in two storage accounts?
Updated by Tom Clegg almost 7 years ago
The context here is #11184 (multiple storage classes). The objective is to move keep-balance to keepstore's mount-oriented index APIs (#11644) as a prerequisite for making decisions based on the mount information (#12708).
Trash and pull requests don't (can't?) target a particular mount (trashing targets the block by mtime). This seems like it would be useful for moving blocks between mounts of different storage classes
Pull requests can (#11644) and will (#12708) specify a target mount.
There is one improvement included here that isn't strictly necessary: 61f4a861f supports removing redundant replicas when a single server (which is one of the best rendezvous order) has replicas on more than one volume. But this issue/branch probably shouldn't turn into "do things that become possible with mount-oriented APIs".
Updated by Tom Clegg almost 7 years ago
Peter Amstutz wrote:
- Instead of using "same mtime" to decide if two replicas are actually the same, it should be using the mount uuid
To be precise, mount UUIDs are always distinct. "Mtime is equal, but DeviceID is different" should make replicas safe to delete, now that we know DeviceID -- but there are many whatifs to consider and tests to write to make that kind of change safely, and it seems tangential to the issue at hand, so I'd rather not creep scope.
- When there are multiple replicas on the same service, replicas on read only mounts should have precedence over writable mounts (since only the writable replicas can be trashed).
- When deciding whether to trash or keep a given replica, it checks read-only on a per-service basis, but not a per-mount basis.
Addressed in 7931-replicas-by-volume @ 41e612b59ad85ee7f22ebf3239ec8ff1cbb463c5
Updated by Peter Amstutz over 6 years ago
Tom Clegg wrote:
Peter Amstutz wrote:
- Instead of using "same mtime" to decide if two replicas are actually the same, it should be using the mount uuid
To be precise, mount UUIDs are always distinct. "Mtime is equal, but DeviceID is different" should make replicas safe to delete, now that we know DeviceID -- but there are many whatifs to consider and tests to write to make that kind of change safely, and it seems tangential to the issue at hand, so I'd rather not creep scope.
Ok, I agree we should continue to use mtime as a predictor that two replicas are probably the same. Going the other way, is it possible that the same actual replica reports two different mtimes, because it is touched while getting the index of two servers that use the same backend? In that case, it would be safer to trash only if both the mtime and the DeviceID are different.
- When there are multiple replicas on the same service, replicas on read only mounts should have precedence over writable mounts (since only the writable replicas can be trashed).
- When deciding whether to trash or keep a given replica, it checks read-only on a per-service basis, but not a per-mount basis.
Addressed in 7931-replicas-by-volume @ 41e612b59ad85ee7f22ebf3239ec8ff1cbb463c5
Waiting for response on https://dev.arvados.org/issues/7931#note-9
Pull requests can (#11644) and will (#12708) specify a target mount.
What about trash requests? If one of the goals is to be able to move blocks to cheaper storage, we need to make sure they get deleted from the more expensive storage.
Clearly more work that is going to happen for per-mount balancing in #12708
Updated by Tom Clegg over 6 years ago
Peter Amstutz wrote:
Ok, I agree we should continue to use mtime as a predictor that two replicas are probably the same. Going the other way, is it possible that the same actual replica reports two different mtimes, because it is touched while getting the index of two servers that use the same backend? In that case, it would be safer to trash only if both the mtime and the DeviceID are different.
This race is a problem if someone changes mtime to a different old timestamp. But when keepstore changes mtime it always changes it to now, and keep-balance doesn't delete replicas with recent mtime. If keep-balance chooses to delete the old copy, the timestamp won't match when keepstore processes its trash list, so nothing will happen.
Waiting for response on https://dev.arvados.org/issues/7931#note-9
Yes, that is the current behavior: desired=N currently has the effect "N × whatever the storage back-end provides". So far we have worked around this by setting default collection replication to 1.
Taking into account the replication level reported by keepstore is one of the improvements unblocked by (but not included in) this story.
Pull requests can (#11644) and will (#12708) specify a target mount.
What about trash requests? If one of the goals is to be able to move blocks to cheaper storage, we need to make sure they get deleted from the more expensive storage.
Sorry, that wasn't a deliberate omission. Yes, trash and pull requests both got that in #11644.
(I see they didn't get json field tags though -- fixed)
7931-replicas-by-volume @ d63d7dc79ed74beaaceda15ca88344de12258da3)
Updated by Peter Amstutz over 6 years ago
Tom Clegg wrote:
Yes, that is the current behavior: desired=N currently has the effect "N × whatever the storage back-end provides". So far we have worked around this by setting default collection replication to 1.
Taking into account the replication level reported by keepstore is one of the improvements unblocked by (but not included in) this story.
Ok, I just wanted to confirm my understanding of the code.
(I see they didn't get json field tags though -- fixed)
7931-replicas-by-volume @ d63d7dc79ed74beaaceda15ca88344de12258da3)
This LGTM.
Updated by Tom Clegg over 6 years ago
- Status changed from In Progress to Resolved