Project

General

Profile

Actions

Bug #15148

closed

keep-balance incorrectly accounts for blocks in collections with null `modified_at` field

Added by Tom Morris almost 4 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
04/26/2019
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-
Release relationship:
Auto

Description

In certain circumstances, when collections have a null modified_at field (which should normally never happen), keep-balance can mark blocks for deletion even though there are still references to them. A check for this will be included in 1.4 and 1.3.2: keep-balance will refuse to run at all if this situation is detected.

An API server bug which causes the modified_at field to be null was introduced by #13561 and released with Arvados v1.3. The API server bug was fixed as a side effect of #14595 and this fix will also be included in 1.4 and 1.3.2.

Related: Recovering lost data


Subtasks 1 (0 open1 closed)

Task #15156: Review 15148-lost-collection-pdhResolvedPeter Amstutz04/26/2019

Actions

Related issues

Related to Arvados - Feature #13561: [API] Store, and add APIs to retrieve, previous versions of collection objectsResolvedLucas Di Pentima10/04/2018

Actions
Actions #1

Updated by Tom Morris almost 4 years ago

Anyone running 1.3 should do these things ASAP:
  • Wherever keep-balance is running: systemctl stop keep-balance (or whatever it takes to disable keep-balance) until a fixed version is released+installed
  • On every keepstore node: add TrashCheckInterval: 87600h to /etc/arvados/keepstore/keepstore.yml, and then systemctl restart keepstore (or equivalent) to avoid deleting any trashed blocks that are still recoverable
Then
  • Install a fixed version of keep-balance and arvados-api-server (≥1.3.2 or ≥1.4)
  • Enable keep-balance
  • Use the keepstore untrash API to recover any blocks that were trashed but not yet deleted (details TBD)
  • Delete/revert TrashCheckInterval in keepstore configs and restart keepstore processes

Any system with containers that finished while running Arvados 1.3 will need a migration to fix the collections table for the output collections of those containers.

The following fixes have been made:

  • Fix to refuse to run if any modified_at fields are null 15112-dont-trash-needed-replica @ 8b9ea19ebde9f4653d6adc145ef6fcbd36d2aace
  • Database migration to repair any null `modified_at`s:
    • 15112-migration @ 243130b8c5a8558d6bd132d4a062483be93ef7bc
    • 15112-migration-1.3 @ a2c3d1ffd627974c8daa3bff300c0ad96f07d3a0
    • Update migration to handle empty database case @ 7bc8e8add
  • Cherry pick #14595 - $ git cherry-pick 2aa58f31ac8fc696361214a05ab9ba75a5140b08 4e32f0b140ec0ec7f96c1f9eaae00950c176ff03
Actions #2

Updated by Tom Clegg almost 4 years ago

  • Description updated (diff)
Actions #4

Updated by Tom Morris almost 4 years ago

  • Release set to 23
Actions #5

Updated by Tom Clegg almost 4 years ago

  • Description updated (diff)
Actions #6

Updated by Tom Clegg almost 4 years ago

  • Install a fixed version of keep-balance and arvados-api-server (≥1.3.2 or ≥1.4)
  • Enable keep-balance
  • Use the keepstore untrash API to recover any blocks that were trashed but not yet deleted (details TBD)

Details: Untrashing lost blocks

Any system with containers that finished while running Arvados 1.3 will need a migration to fix the collections table for the output collections of those containers.

This migration runs during the upgrade to arvados ≥1.3.2 or ≥1.4.

Actions #8

Updated by Peter Amstutz almost 4 years ago

Tom Clegg wrote:

15148-lost-collection-pdh @ 6c5852fb18c0b6422c079c6fee66891a273ad089 -- https://ci.curoverse.com/view/Developer/job/developer-run-tests/1219/

Waiting on jenkins but this LGTM.

Actions #9

Updated by Tom Clegg almost 4 years ago

This change is in master (48d2a213b, destined for 1.4) and 1.3-dev (675237bec, destined for 1.3.3).

Each line of the "lost blocks" file will now be "BLOCKHASH PDH1 PDH2 ..." where PDH* are all collections that refer to BLOCKHASH. From here you can get a complete list of affected collection PDHs:

cut -d" " -f2- < lost-blocks.txt | tr " " "\n" | sort -u > lost-collections.txt
Actions #10

Updated by Tom Clegg over 3 years ago

  • Status changed from In Progress to Resolved
Actions #11

Updated by Tom Morris over 3 years ago

  • Description updated (diff)
Actions #12

Updated by Tom Morris over 3 years ago

  • Release changed from 23 to 24
Actions #13

Updated by Tom Morris over 3 years ago

  • Description updated (diff)
Actions #14

Updated by Tom Morris over 3 years ago

  • Related to Feature #13561: [API] Store, and add APIs to retrieve, previous versions of collection objects added
Actions

Also available in: Atom PDF