[Data manager] Verbose reporting
In order to help diagnose Keep issues, data manager should have an option to provide detailed information about inconsistencies with blocks and collections:
- List each keep server that provided an index, and dump the contents of the index that was received from each one.
- List all blocks considered "not in any collection"
- List all blocks considered "missing" and list the collection UUID that reference each missing block
- List all blocks considered "over replicated"
#2 Updated by Tom Clegg over 3 years ago
Sounds like a good feature. How about just writing a CSV report, with one line per locator:
Perhaps (later?) a separate option for a CSV report with one line per collection:
uuid,pdh,want,n=0,n=1,n=2,... where the n=2 column indicates #blocks in this collection that are currently at replication=2.
#5 Updated by Peter Amstutz over 3 years ago
- If given a flag, data manager produces a "missing blocks" report at a specified location. This consists of pairs of [block, collection uuid].
- Write a Python script which consumes the missing blocks report and reports precisely which files within each collection are affected by missing blocks.
#6 Updated by Peter Amstutz over 3 years ago
Copied from #8878:
To help you recover while we continue to try and diagnose the underlying bug, I've added some additional reporting to datamanager, along with an auxiliary python script. These are in the 8912-missing-blocks-report branch and for your convenience I've attached a binary of
datamanager and a copy of the python script
keep_block_to_file.py to this ticket.
datamanager -dry-run -extra-reports will produce some timestamped files, the formats are
timestamp_uuid_missing.txt. The former is the indexes returned by each keepstore to datamanager, the latter is the collections with missing blocks.
You then use
keep_block_to_file.py *_missing.txt to get the list of specific files within each collection which have missing blocks.
#9 Updated by Tom Clegg over 3 years ago
Looking at 8912-missing-blocks-report at 48bafad...
datamanager flag description should mention that it will create (multiple) log files in CWD in each iteration. Calling it "-debug-extra-logs" might also be a helpful signal that it will create a bit of a mess.
The proposed "missing blocks" file format (e.g., "collection_uuid,missing_block\n") still seems better to me than the "separate file for each collection" approach implemented here. For example, it would avoid the issue of creating (potentially) thousands of log files at a time if storage volumes are down, and it would make it possible to use keep_block_to_file in a unix pipeline (e.g.,
"zcat report.gz | py") instead of relying on regexing the report filename to get the collection uuids.
Same goes for LogKeepIndex: just one file for the entire index at time X would be less sprawl, and you can always cut/grep later if you want separate files for some reason.
Please use "continue" in error handling, instead of "else" pyramids.
Python script should have some sort of usage comment.
Python script output should be machine-readable. Perhaps
import csv writer = csv.writer(sys.stdout) if st in missingblocks: writer.writerow([collection, name, st])