Project

General

Profile

Actions

Bug #7853

closed

[Data Manager] Behave appropriately when multiple keepstore nodes share a single storage volume

Added by Tom Clegg about 9 years ago. Updated almost 8 years ago.

Status:
Closed
Priority:
Normal
Assigned To:
-
Category:
Keep
Target version:
-
Story points:
1.0

Description

Background

Currently, Data Manager assumes that if two keepstore nodes report the same block B in their indexes, there are two copies of block B stored on two distinct volumes on two distinct nodes. This is not true in the recommended blob storage configuration: multiple keepstore services use a single blob storage volume.

For example, in a blob storage configuration with 8 keepstore nodes sharing a single volume, Data Manager will see 8 copies and consider all blocks to be overreplicated.

This confusion will also arise if two keepstore servers access the same backing store, which can happen in such a way that their paths are unique (e.g., different local mount points attached to the same NFS mount).

A related problem is that volumes could get moved around after data manager makes decisions about what to delete, but before it takes action. This could result in too many replicas being deleted.

Resolution

Data manager must compare the stored "last PUT" timestamps for each block in the trash list. It must ensure the timestamp on an excess copy does not match the timestamp on any of the still-needed copies. If such a collision occurs, the excess copy cannot be deleted safely.
  • Data manager should perform some operation that will refresh the timestamp of the still-needed copies. This will allow the excess block to be deleted in a subsequent run.
  • Data manager must not delete the excess copy.

Related issues

Not addressed in this story:
  • If a client writes 1 copy to each of 2 keepstore services that use the same backing store, the client will erroneously conclude that it has achieved replication=2.
Actions #1

Updated by Tom Clegg about 9 years ago

  • Description updated (diff)
Actions #2

Updated by Tom Clegg about 9 years ago

To address the general case, where a keepstore node has multiple volumes that are shared with different sets of other keepstore nodes, Data Manager needs:
  • an ID for each volume that can be used to determine whether two keepstore servers share a backing store (e.g., "UnixVolume:hostname:dir", "UnixVolume:disk-uuid:dir", "AzureBlobVolume:accountName:containerName")
  • a distinct index for each volume (e.g., /volumes/volumeID/index)

Another option, less general but quicker to implement: A "don't index / don't manage" flag on each keep_service record would enable a common special case (disjoint sets A, B, C of backing stores where some keepstore servers use A, others use B, others use C). This could have other uses too, like performing orderly removal of an excess/EOL node by doing most of the restoring/rebalancing before the node becomes unavailable to clients.

Actions #3

Updated by Brett Smith about 9 years ago

FWIW, we had a lunch discussion about this from the perspective of clients trying to achieve particular replication levels when they write blocks. For example, assume a cluster with two Keepstores backed by the same blob storage volume that provides 3x replication. If a client wants to write a block with 4x replication, today I believe it will PUT that block to both Keepstores and succeed because it believes it achieved 6x replication (3x replication reported by both Keepstores), when in reality the block is only replicated 3x in the shared storage volume.

A solution that let us fix this too (e.g., tying replication reports to volume IDs) would be nice.

Actions #4

Updated by Tom Clegg about 9 years ago

Another interim option is for data manager to accept a list of service UUIDs to manage. This would be more work for the sysadmin to maintain, but wouldn't require any server-side changes at all.

If, by the time we need data manager correctness, we have already started tracking and exposing volume IDs in order to deal with the problem of initial replication level attained by clients, then "different index per volume ID" will probably be that much easier to implement. Either way, it's certainly a more complete solution than a "don't manage" flag.

Actions #5

Updated by Tom Clegg about 9 years ago

  • Description updated (diff)
  • Category set to Keep
  • Story points set to 1.0
Actions #6

Updated by Brett Smith about 9 years ago

  • Target version set to Arvados Future Sprints
Actions #7

Updated by Tom Clegg almost 8 years ago

  • Status changed from New to Closed
Actions #8

Updated by Tom Clegg almost 8 years ago

  • Target version deleted (Arvados Future Sprints)
Actions

Also available in: Atom PDF