Project

General

Profile

Actions

Bug #8878

closed

Keep: sudden appearance of "missing" blocks

Added by Peter Grandi about 8 years ago. Updated over 4 years ago.

Status:
Closed
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
-
Story points:
-

Description

I had done a "garbage collection" before Easter as follows:

2016/03/24 17:06:10 Read and processed 417 collections
2016/03/24 17:06:13 Blocks In Collections: 514668, 
Blocks In Keep: 961866.
2016/03/24 17:06:13 Replication Block Counts:
 Missing From Keep: 0, 
 Under Replicated: 0, 
 Over Replicated: 1650, 
 Replicated Just Right: 513018, 
 Not In Any Collection: 447198. 
Replication Collection Counts:
 Missing From Keep: 0, 
 Under Replicated: 0, 
 Over Replicated: 11, 
 Replicated Just Right: 406.
2016/03/24 17:06:13 Blocks Histogram:
2016/03/24 17:06:13 {Requested:0 Actual:1}:     444455
2016/03/24 17:06:13 {Requested:0 Actual:2}:       2743
2016/03/24 17:06:13 {Requested:1 Actual:1}:     513018
2016/03/24 17:06:13 {Requested:1 Actual:2}:       1647
2016/03/24 17:06:13 {Requested:1 Actual:3}:          3
2016/03/24 17:06:15 Sending trash list to http://keep9.gcam1.example.com:25107
2016/03/24 17:06:15 Sending trash list to http://keep3.gcam1.example.com:25107
2016/03/24 17:06:15 Sending trash list to http://keep6.gcam1.example.com:25107
2016/03/24 17:06:15 Sending trash list to http://keep5.gcam1.example.com:25107
2016/03/24 17:06:15 Sending trash list to http://keep0.gcam1.example.com:25107
2016/03/24 17:06:15 Sending trash list to http://keep4.gcam1.example.com:25107
2016/03/24 17:06:15 Sending trash list to http://keep7.gcam1.example.com:25107
2016/03/24 17:06:15 Sending trash list to http://keep8.gcam1.example.com:25107
2016/03/24 17:06:15 Sending trash list to http://keep1.gcam1.example.com:25107
2016/03/24 17:06:15 Sent trash list to http://keep1.gcam1.example.com:25107: response was HTTP 200 OK
2016/03/24 17:06:15 Sent trash list to http://keep0.gcam1.example.com:25107: response was HTTP 200 OK
2016/03/24 17:06:16 Sent trash list to http://keep4.gcam1.example.com:25107: response was HTTP 200 OK
2016/03/24 17:06:16 Sent trash list to http://keep9.gcam1.example.com:25107: response was HTTP 200 OK
2016/03/24 17:06:16 Sent trash list to http://keep3.gcam1.example.com:25107: response was HTTP 200 OK
2016/03/24 17:06:16 Sent trash list to http://keep5.gcam1.example.com:25107: response was HTTP 200 OK
2016/03/24 17:06:16 Sent trash list to http://keep8.gcam1.example.com:25107: response was HTTP 200 OK
2016/03/24 17:06:16 Sent trash list to http://keep7.gcam1.example.com:25107: response was HTTP 200 OK
2016/03/24 17:06:16 Sent trash list to http://keep6.gcam1.example.com:25107: response was HTTP 200 OK

Then after uploading two 4GB collections over the past week, we have deleted the 2 4GB collections that they were meant to replace, and then I run the Data Manager again in dry-run mode and the outcome is:

2016/04/04 12:51:17 Read and processed 421 collections                     
2016/04/04 12:51:19 Blocks In Collections: 782548,                              
Blocks In Keep: 716788.                                                         
2016/04/04 12:51:19 Replication Block Counts:                                   
 Missing From Keep: 65760,                                                      
 Under Replicated: 0,                                                           
 Over Replicated: 41180,                                                        
 Replicated Just Right: 675608,                                                 
 Not In Any Collection: 0.                                                      
Replication Collection Counts:                                                  
 Missing From Keep: 3,                                                          
 Under Replicated: 0,                                                           
 Over Replicated: 13,                                                           
 Replicated Just Right: 405.                                                    
2016/04/04 12:51:19 Blocks Histogram:                                           
2016/04/04 12:51:19 {Requested:1 Actual:0}:      65760                          
2016/04/04 12:51:19 {Requested:1 Actual:1}:     675608                          
2016/04/04 12:51:19 {Requested:1 Actual:2}:      41177                          
2016/04/04 12:51:19 {Requested:1 Actual:3}:          3

It is disconcerting to see {Requested:1 Actual:0}: 65760 (around 4GiB) but also {Requested:1 Actual:2}: 41177 (around 2.5GiB).

The two collections that were uploaded to replace the two that were deleted should have been exactly identical byte for byte, as the re-uploads were from the same files using identically the same file list.

A question I have is whether there is a tool that can tell me which collections and files within them have missing hashes. I think that I can easily modify some of my scripts to that purpose, so I would like to know if there is a tool that I can use as a double check.

The other question is whether I can run with Data Manager further consistency checks, for example as to verifying the hashes of the data blocks.


Files

160329_arvDiskFree.png (37.6 KB) 160329_arvDiskFree.png Peter Grandi, 04/05/2016 08:23 AM
keep_block_to_file.py (941 Bytes) keep_block_to_file.py Peter Amstutz, 04/10/2016 08:26 PM
datamanager (8.41 MB) datamanager Peter Amstutz, 04/10/2016 08:27 PM
160404_arvDiskFreeNotes.png (50.2 KB) 160404_arvDiskFreeNotes.png Peter Grandi, 04/11/2016 10:11 AM
2016-04-11T13_34_34Z_filescount.txt (12.5 KB) 2016-04-11T13_34_34Z_filescount.txt Peter Grandi, 04/11/2016 02:52 PM

Related issues

Related to Arvados - Idea #8724: [Keep] Block validation scriptResolvedRadhika Chippada03/16/2016Actions
Related to Arvados - Bug #8910: [SDK] arv-put should save manifest text on API errorResolvedLucas Di PentimaActions
Related to Arvados - Feature #8993: arv-put: options for 3 modes of "resumption"Closed04/14/2016Actions
Actions

Also available in: Atom PDF