Feature #15125

[keep-balance] [keepstore] Procedure to halt/reverse/investigate a suspected data loss incident

Added by Tom Clegg 3 months ago. Updated 3 months ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
Story points:
-

Description

A site admin, upon suspecting keep-balance is erroneously trashing some data, should be able to
  • act quickly to minimize the impact, and
  • characterize the damage, if any
Steps to minimize the impact:
  • immediately prevent keepstore from trashing or deleting any blocks while investigation/recovery proceeds
  • untrash any blocks that might have been trashed erroneously (this may enable affected workflows to resume)
Steps to characterize the damage:
  • get a list of missing block IDs
  • get a list of collections that reference missing blocks (including uuid, pdh, name, project uuid, project name)
Troubleshooting:
  • report version in metrics (e.g., version{program="keep-balance", version="1.3.1"} = 1)
  • report #+size of trashed blocks in metrics
  • keepstore "untrash all" management API
  • keep-balance reporting option to get debug info for a list of specific collection IDs and block IDs (without getting the entire debug dump, which is huge)
  • keep-block-check --collection=uuid_or_pdh

History

#2 Updated by Tom Clegg 3 months ago

  • Description updated (diff)

#3 Updated by Tom Morris 3 months ago

  • Target version set to To Be Groomed

Also available in: Atom PDF