[Data Manager] Behave appropriately when multiple keepstore nodes share a single storage volume
Currently, Data Manager assumes that if two keepstore nodes report the same block B in their indexes, there are two copies of block B stored on two distinct volumes on two distinct nodes. This is not true in the recommended blob storage configuration: multiple keepstore services use a single blob storage volume.
For example, in a blob storage configuration with 8 keepstore nodes sharing a single volume, Data Manager will see 8 copies and consider all blocks to be overreplicated.
This confusion will also arise if two keepstore servers access the same backing store, which can happen in such a way that their paths are unique (e.g., different local mount points attached to the same NFS mount).
A related problem is that volumes could get moved around after data manager makes decisions about what to delete, but before it takes action. This could result in too many replicas being deleted.
Resolution¶Data manager must compare the stored "last PUT" timestamps for each block in the trash list. It must ensure the timestamp on an excess copy does not match the timestamp on any of the still-needed copies. If such a collision occurs, the excess copy cannot be deleted safely.
- Data manager should perform some operation that will refresh the timestamp of the still-needed copies. This will allow the excess block to be deleted in a subsequent run.
- Data manager must not delete the excess copy.
Related issues¶Not addressed in this story:
- If a client writes 1 copy to each of 2 keepstore services that use the same backing store, the client will erroneously conclude that it has achieved replication=2.
Updated by Tom Clegg over 6 years ago
- an ID for each volume that can be used to determine whether two keepstore servers share a backing store (e.g., "UnixVolume:hostname:dir", "UnixVolume:disk-uuid:dir", "AzureBlobVolume:accountName:containerName")
- a distinct index for each volume (e.g., /volumes/volumeID/index)
Another option, less general but quicker to implement: A "don't index / don't manage" flag on each keep_service record would enable a common special case (disjoint sets A, B, C of backing stores where some keepstore servers use A, others use B, others use C). This could have other uses too, like performing orderly removal of an excess/EOL node by doing most of the restoring/rebalancing before the node becomes unavailable to clients.
Updated by Brett Smith over 6 years ago
FWIW, we had a lunch discussion about this from the perspective of clients trying to achieve particular replication levels when they write blocks. For example, assume a cluster with two Keepstores backed by the same blob storage volume that provides 3x replication. If a client wants to write a block with 4x replication, today I believe it will PUT that block to both Keepstores and succeed because it believes it achieved 6x replication (3x replication reported by both Keepstores), when in reality the block is only replicated 3x in the shared storage volume.
A solution that let us fix this too (e.g., tying replication reports to volume IDs) would be nice.
Updated by Tom Clegg over 6 years ago
Another interim option is for data manager to accept a list of service UUIDs to manage. This would be more work for the sysadmin to maintain, but wouldn't require any server-side changes at all.
If, by the time we need data manager correctness, we have already started tracking and exposing volume IDs in order to deal with the problem of initial replication level attained by clients, then "different index per volume ID" will probably be that much easier to implement. Either way, it's certainly a more complete solution than a "don't manage" flag.