Story #7931

[keep-balance] Count block replication by volume IDs

Added by Brett Smith over 4 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Keep
Target version:
Start date:
12/02/2015
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-
Release relationship:
Auto

Description

Use the volume information added in #7928. Count the number of times a block is replicated for each volume it's stored on, rather than each Keepstore that reports it in the index.


Subtasks

Task #13155: Review 7931-replicas-by-volumeResolvedPeter Amstutz


Related issues

Related to Arvados - Feature #11184: [Keep] Support multiple storage classesIn Progress

Blocked by Arvados - Story #7928: [Keep] keepstore identifies underlying volumes to clientsDuplicate12/02/2015

Associated revisions

Revision d85b7e29
Added by Tom Clegg over 2 years ago

Merge branch '7931-replicas-by-volume'

refs #7931

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

History

#1 Updated by Tom Clegg almost 3 years ago

  • Subject changed from [Data Manager] Count block replication by volume IDs to [keep-balance] Count block replication by volume IDs

#2 Updated by Tom Clegg over 2 years ago

  • Related to Feature #11184: [Keep] Support multiple storage classes added

#3 Updated by Tom Morris over 2 years ago

  • Target version set to To Be Groomed

#4 Updated by Tom Morris over 2 years ago

  • Assigned To set to Tom Clegg
  • Target version changed from To Be Groomed to 2018-03-14 Sprint

#5 Updated by Tom Clegg over 2 years ago

  • Target version changed from 2018-03-14 Sprint to 2018-03-28 Sprint

#6 Updated by Tom Clegg over 2 years ago

  • Status changed from New to In Progress

#7 Updated by Tom Clegg over 2 years ago

#8 Updated by Peter Amstutz over 2 years ago

I understand we currently run keep in a high availability configuration, where each service has a primary writable mount, and has a secondary read only mount which is the primary mount of some other keepstore. So I'm assuming we want keep-balance to do the right thing on this configuration.

Generally, this doesn't go all the way to identifying replicas by mount+pdh, and still relies on pdh+mtime logic.

  • Instead of using "same mtime" to decide if two replicas are actually the same, it should be using the mount uuid
  • When there are multiple replicas on the same service, replicas on read only mounts should have precedence over writable mounts (since only the writable replicas can be trashed).
  • When deciding whether to trash or keep a given replica, it checks read-only on a per-service basis, but not a per-mount basis.
  • Trash and pull requests don't (can't?) target a particular mount (trashing targets the block by mtime). This seems like it would be useful for moving blocks between mounts of different storage classes.

#9 Updated by Peter Amstutz over 2 years ago

Also, it doesn't seem to be taking the underlying Replication count of the mount into account? So if it is located in an object storage account with Replication=3, and the block has requested replication=2, it would still store it in two storage accounts?

#10 Updated by Tom Clegg over 2 years ago

The context here is #11184 (multiple storage classes). The objective is to move keep-balance to keepstore's mount-oriented index APIs (#11644) as a prerequisite for making decisions based on the mount information (#12708).

Trash and pull requests don't (can't?) target a particular mount (trashing targets the block by mtime). This seems like it would be useful for moving blocks between mounts of different storage classes

Pull requests can (#11644) and will (#12708) specify a target mount.

There is one improvement included here that isn't strictly necessary: 61f4a861f supports removing redundant replicas when a single server (which is one of the best rendezvous order) has replicas on more than one volume. But this issue/branch probably shouldn't turn into "do things that become possible with mount-oriented APIs".

#11 Updated by Tom Clegg over 2 years ago

Peter Amstutz wrote:

  • Instead of using "same mtime" to decide if two replicas are actually the same, it should be using the mount uuid

To be precise, mount UUIDs are always distinct. "Mtime is equal, but DeviceID is different" should make replicas safe to delete, now that we know DeviceID -- but there are many whatifs to consider and tests to write to make that kind of change safely, and it seems tangential to the issue at hand, so I'd rather not creep scope.

  • When there are multiple replicas on the same service, replicas on read only mounts should have precedence over writable mounts (since only the writable replicas can be trashed).
  • When deciding whether to trash or keep a given replica, it checks read-only on a per-service basis, but not a per-mount basis.

Addressed in 7931-replicas-by-volume @ 41e612b59ad85ee7f22ebf3239ec8ff1cbb463c5

#12 Updated by Peter Amstutz over 2 years ago

Tom Clegg wrote:

Peter Amstutz wrote:

  • Instead of using "same mtime" to decide if two replicas are actually the same, it should be using the mount uuid

To be precise, mount UUIDs are always distinct. "Mtime is equal, but DeviceID is different" should make replicas safe to delete, now that we know DeviceID -- but there are many whatifs to consider and tests to write to make that kind of change safely, and it seems tangential to the issue at hand, so I'd rather not creep scope.

Ok, I agree we should continue to use mtime as a predictor that two replicas are probably the same. Going the other way, is it possible that the same actual replica reports two different mtimes, because it is touched while getting the index of two servers that use the same backend? In that case, it would be safer to trash only if both the mtime and the DeviceID are different.

  • When there are multiple replicas on the same service, replicas on read only mounts should have precedence over writable mounts (since only the writable replicas can be trashed).
  • When deciding whether to trash or keep a given replica, it checks read-only on a per-service basis, but not a per-mount basis.

Addressed in 7931-replicas-by-volume @ 41e612b59ad85ee7f22ebf3239ec8ff1cbb463c5

Waiting for response on https://dev.arvados.org/issues/7931#note-9

Pull requests can (#11644) and will (#12708) specify a target mount.

What about trash requests? If one of the goals is to be able to move blocks to cheaper storage, we need to make sure they get deleted from the more expensive storage.

Clearly more work that is going to happen for per-mount balancing in #12708

#13 Updated by Tom Clegg over 2 years ago

Peter Amstutz wrote:

Ok, I agree we should continue to use mtime as a predictor that two replicas are probably the same. Going the other way, is it possible that the same actual replica reports two different mtimes, because it is touched while getting the index of two servers that use the same backend? In that case, it would be safer to trash only if both the mtime and the DeviceID are different.

This race is a problem if someone changes mtime to a different old timestamp. But when keepstore changes mtime it always changes it to now, and keep-balance doesn't delete replicas with recent mtime. If keep-balance chooses to delete the old copy, the timestamp won't match when keepstore processes its trash list, so nothing will happen.

Waiting for response on https://dev.arvados.org/issues/7931#note-9

Yes, that is the current behavior: desired=N currently has the effect "N × whatever the storage back-end provides". So far we have worked around this by setting default collection replication to 1.

Taking into account the replication level reported by keepstore is one of the improvements unblocked by (but not included in) this story.

Pull requests can (#11644) and will (#12708) specify a target mount.

What about trash requests? If one of the goals is to be able to move blocks to cheaper storage, we need to make sure they get deleted from the more expensive storage.

Sorry, that wasn't a deliberate omission. Yes, trash and pull requests both got that in #11644.

(I see they didn't get json field tags though -- fixed)

7931-replicas-by-volume @ d63d7dc79ed74beaaceda15ca88344de12258da3)

#14 Updated by Peter Amstutz over 2 years ago

Tom Clegg wrote:

Yes, that is the current behavior: desired=N currently has the effect "N × whatever the storage back-end provides". So far we have worked around this by setting default collection replication to 1.

Taking into account the replication level reported by keepstore is one of the improvements unblocked by (but not included in) this story.

Ok, I just wanted to confirm my understanding of the code.

(I see they didn't get json field tags though -- fixed)

7931-replicas-by-volume @ d63d7dc79ed74beaaceda15ca88344de12258da3)

This LGTM.

#15 Updated by Tom Clegg over 2 years ago

  • Status changed from In Progress to Resolved

#16 Updated by Tom Morris about 2 years ago

  • Release set to 17

Also available in: Atom PDF