Project

General

Profile

Actions

Feature #15125

open

[keep-balance] [keepstore] Procedure to halt/reverse/investigate a suspected data loss incident

Added by Tom Clegg about 5 years ago. Updated 2 months ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
Story points:
-
Release:
Release relationship:
Auto

Description

A site admin, upon suspecting keep-balance is erroneously trashing some data, should be able to
  • act quickly to minimize the impact, and
  • characterize the damage, if any
Steps to minimize the impact:
  • immediately prevent keepstore from trashing or deleting any blocks while investigation/recovery proceeds
  • untrash any blocks that might have been trashed erroneously (this may enable affected workflows to resume)
Steps to characterize the damage:
  • get a list of missing block IDs
  • get a list of collections that reference missing blocks (including uuid, pdh, name, project uuid, project name)
Troubleshooting:
  • report version in metrics (e.g., version{program="keep-balance", version="1.3.1"} = 1)
  • report #+size of trashed blocks in metrics
  • keepstore "untrash all" management API
  • keep-balance reporting option to get debug info for a list of specific collection IDs and block IDs (without getting the entire debug dump, which is huge)
  • keep-block-check --collection=uuid_or_pdh

Related issues

Related to Arvados Epics - Idea #16514: Actionable insight into keep usageNewActions
Actions

Also available in: Atom PDF