Project

General

Profile

Actions

Feature #16573

closed

[keep] deduplication reporting tool

Added by Ward Vandewege over 3 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Story points:
-
Release relationship:
Auto

Description

Create a tool that, when given a list of 2 or more collection UUIDs, prints out a report that shows the aggregate nominal and the actual size of the collections, and indicates how much space is saved by Keep's deduplication.

Add this tool as a command to arvados-client.


Subtasks 1 (0 open1 closed)

Task #16579: Review branch 16573-keep-deduplication-reporting-toolResolvedWard Vandewege07/10/2020Actions

Related issues

Related to Arvados Epics - Idea #16514: Actionable insight into keep usageNewActions
Actions #1

Updated by Ward Vandewege over 3 years ago

  • Related to Idea #16514: Actionable insight into keep usage added
Actions #2

Updated by Ward Vandewege over 3 years ago

  • Subject changed from compare-collections tool to [keep] collection deduplication reporting tool
Actions #3

Updated by Ward Vandewege over 3 years ago

  • Subject changed from [keep] collection deduplication reporting tool to [keep] deduplication reporting tool
Actions #4

Updated by Ward Vandewege over 3 years ago

Ready for review at 2e0648fb2b8a006664e6225826d78916f682eff5 on branch 16573-keep-deduplication-reporting-tool. Tests are at developer-run-tests: #1952

Actions #5

Updated by Ward Vandewege over 3 years ago

  • Target version changed from 2020-07-01 Sprint to 2020-07-15
Actions #6

Updated by Tom Clegg over 3 years ago

Usage note: "nominal space used by the list of collection" → "list of collections" (or just "collections")

Usage example: instead of assuming server default page size is 100, how about adding --limit 100 and changing "tail -n100" to "tail -n+2"

deDuplicate: map lookups return the zero value when the key is not found, so with a bool map where you only store true values, you can replace if _, ok := seen[uuid]; !ok with just if !seen[uuid] (similar thing in the for _, v := range blocks loop in report())

if len(inputs) < 2 -- could this be < 1? If there's only one UUID after deduplication, I think it would be reasonable to output the simple answer rather than throwing an error (I'm guessing it would just work that way if you remove the error check)

If not... error message "different collections UUIDs" → "different collection UUIDs"

fmt.Println() blank line before the loop in report() seems unnecessary ... but if we keep it, it should be fmt.Fprintln(stdout)

Delete commented-out imports in report_test.go

The OverlappingCollections test cases appear nearly identical - maybe they could be combined into a parameterized test along the lines of TestPermission in source:services/arv-git-httpd/auth_handler_test.go

The manifests in the test cases appear to have hard-coded permission signatures -- signature ending in @5f0de808 will expire on Tue Jul 14 13:14:48 EDT 2020 which might cause tests to fail. Since NewClientFromEnv gives you an admin token for tests, can you leave off the +A signatures entirely?

Feature suggestion / scope creep: instead of relying on the user to provide PDHs for efficiency, the tool could start by doing a collection list with select=pdh, filters=[uuid,in,[...]] (which should always be fast), then deduplicate by pdh, then fetch the manifests.

Actions #7

Updated by Ward Vandewege over 3 years ago

Tom Clegg wrote:

Usage note: "nominal space used by the list of collection" → "list of collections" (or just "collections")

Fixed.

Usage example: instead of assuming server default page size is 100, how about adding --limit 100 and changing "tail -n100" to "tail -n+2"

Nice, fixed.

deDuplicate: map lookups return the zero value when the key is not found, so with a bool map where you only store true values, you can replace if _, ok := seen[uuid]; !ok with just if !seen[uuid] (similar thing in the for _, v := range blocks loop in report())

Oh, yes, fixed.

if len(inputs) < 2 -- could this be < 1? If there's only one UUID after deduplication, I think it would be reasonable to output the simple answer rather than throwing an error (I'm guessing it would just work that way if you remove the error check)

Yes, it does just work; I changed it to < 1 and adjusted the error message accordingly.

fmt.Println() blank line before the loop in report() seems unnecessary ... but if we keep it, it should be fmt.Fprintln(stdout)

Right! I dropped it.

Delete commented-out imports in report_test.go

Done.

The OverlappingCollections test cases appear nearly identical - maybe they could be combined into a parameterized test along the lines of TestPermission in source:services/arv-git-httpd/auth_handler_test.go

Done.

The manifests in the test cases appear to have hard-coded permission signatures -- signature ending in @5f0de808 will expire on Tue Jul 14 13:14:48 EDT 2020 which might cause tests to fail. Since NewClientFromEnv gives you an admin token for tests, can you leave off the +A signatures entirely?

Nice, done.

Feature suggestion / scope creep: instead of relying on the user to provide PDHs for efficiency, the tool could start by doing a collection list with select=pdh, filters=[uuid,in,[...]] (which should always be fast), then deduplicate by pdh, then fetch the manifests.

Yeah; if it's OK I'll leave that as a future improvement.

All fixes at 544d7aae0d58a25e8c761c638167c3564de06af5 on branch 16573-keep-deduplication-reporting-tool.

Actions #8

Updated by Tom Clegg over 3 years ago

Just a couple of nits

logger.Errorf("...\n") doesn't need "\n" (unlike fmt.Fprintf())

might as well delete this:

+       //c.Check(stdout.String(), check.Equals, "")

Rest LGTM, thanks!

Actions #9

Updated by Ward Vandewege over 3 years ago

  • Status changed from In Progress to Resolved

I've made those changes and merged, thanks.

Actions #10

Updated by Ward Vandewege over 3 years ago

  • Release set to 25
Actions

Also available in: Atom PDF