Feature #15125

[keep-balance] [keepstore] Procedure to halt/reverse/investigate a suspected data loss incident

Added by Tom Clegg over 2 years ago. Updated 3 months ago.

Assigned To:
Target version:
Start date:
Due date:
% Done:


Estimated time:
Story points:


A site admin, upon suspecting keep-balance is erroneously trashing some data, should be able to
  • act quickly to minimize the impact, and
  • characterize the damage, if any
Steps to minimize the impact:
  • immediately prevent keepstore from trashing or deleting any blocks while investigation/recovery proceeds
  • untrash any blocks that might have been trashed erroneously (this may enable affected workflows to resume)
Steps to characterize the damage:
  • get a list of missing block IDs
  • get a list of collections that reference missing blocks (including uuid, pdh, name, project uuid, project name)
  • report version in metrics (e.g., version{program="keep-balance", version="1.3.1"} = 1)
  • report #+size of trashed blocks in metrics
  • keepstore "untrash all" management API
  • keep-balance reporting option to get debug info for a list of specific collection IDs and block IDs (without getting the entire debug dump, which is huge)
  • keep-block-check --collection=uuid_or_pdh

Related issues

Related to Arvados Epics - Story #16514: Actionable insight into keep usageNew01/01/202203/31/2022


#2 Updated by Tom Clegg over 2 years ago

  • Description updated (diff)

#3 Updated by Tom Morris over 2 years ago

  • Target version set to To Be Groomed

#4 Updated by Ward Vandewege over 1 year ago

  • Related to Story #16514: Actionable insight into keep usage added

#5 Updated by Peter Amstutz 3 months ago

  • Target version deleted (To Be Groomed)

Also available in: Atom PDF