Feature #16573

[keep] deduplication reporting tool

Added by Ward Vandewege over 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
07/10/2020
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-
Release relationship:
Auto

Description

Create a tool that, when given a list of 2 or more collection UUIDs, prints out a report that shows the aggregate nominal and the actual size of the collections, and indicates how much space is saved by Keep's deduplication.

Add this tool as a command to arvados-client.


Subtasks

Task #16579: Review branch 16573-keep-deduplication-reporting-toolResolvedWard Vandewege


Related issues

Related to Arvados Epics - Story #16514: Actionable insight into keep usageNew06/01/202209/30/2022

Associated revisions

Revision 28607876
Added by Ward Vandewege over 1 year ago

Merge branch '16573-keep-deduplication-reporting-tool'

closes #16573

Arvados-DCO-1.1-Signed-off-by: Ward Vandewege <>

History

#1 Updated by Ward Vandewege over 1 year ago

  • Related to Story #16514: Actionable insight into keep usage added

#2 Updated by Ward Vandewege over 1 year ago

  • Subject changed from compare-collections tool to [keep] collection deduplication reporting tool

#3 Updated by Ward Vandewege over 1 year ago

  • Subject changed from [keep] collection deduplication reporting tool to [keep] deduplication reporting tool

#4 Updated by Ward Vandewege over 1 year ago

Ready for review at 2e0648fb2b8a006664e6225826d78916f682eff5 on branch 16573-keep-deduplication-reporting-tool. Tests are at https://ci.arvados.org/view/Developer/job/developer-run-tests/1952/

#5 Updated by Ward Vandewege over 1 year ago

  • Target version changed from 2020-07-01 Sprint to 2020-07-15

#6 Updated by Tom Clegg over 1 year ago

Usage note: "nominal space used by the list of collection" → "list of collections" (or just "collections")

Usage example: instead of assuming server default page size is 100, how about adding --limit 100 and changing "tail -n100" to "tail -n+2"

deDuplicate: map lookups return the zero value when the key is not found, so with a bool map where you only store true values, you can replace if _, ok := seen[uuid]; !ok with just if !seen[uuid] (similar thing in the for _, v := range blocks loop in report())

if len(inputs) < 2 -- could this be < 1? If there's only one UUID after deduplication, I think it would be reasonable to output the simple answer rather than throwing an error (I'm guessing it would just work that way if you remove the error check)

If not... error message "different collections UUIDs" → "different collection UUIDs"

fmt.Println() blank line before the loop in report() seems unnecessary ... but if we keep it, it should be fmt.Fprintln(stdout)

Delete commented-out imports in report_test.go

The OverlappingCollections test cases appear nearly identical - maybe they could be combined into a parameterized test along the lines of TestPermission in source:services/arv-git-httpd/auth_handler_test.go

The manifests in the test cases appear to have hard-coded permission signatures -- signature ending in @5f0de808 will expire on Tue Jul 14 13:14:48 EDT 2020 which might cause tests to fail. Since NewClientFromEnv gives you an admin token for tests, can you leave off the +A signatures entirely?

Feature suggestion / scope creep: instead of relying on the user to provide PDHs for efficiency, the tool could start by doing a collection list with select=pdh, filters=[uuid,in,[...]] (which should always be fast), then deduplicate by pdh, then fetch the manifests.

#7 Updated by Ward Vandewege over 1 year ago

Tom Clegg wrote:

Usage note: "nominal space used by the list of collection" → "list of collections" (or just "collections")

Fixed.

Usage example: instead of assuming server default page size is 100, how about adding --limit 100 and changing "tail -n100" to "tail -n+2"

Nice, fixed.

deDuplicate: map lookups return the zero value when the key is not found, so with a bool map where you only store true values, you can replace if _, ok := seen[uuid]; !ok with just if !seen[uuid] (similar thing in the for _, v := range blocks loop in report())

Oh, yes, fixed.

if len(inputs) < 2 -- could this be < 1? If there's only one UUID after deduplication, I think it would be reasonable to output the simple answer rather than throwing an error (I'm guessing it would just work that way if you remove the error check)

Yes, it does just work; I changed it to < 1 and adjusted the error message accordingly.

fmt.Println() blank line before the loop in report() seems unnecessary ... but if we keep it, it should be fmt.Fprintln(stdout)

Right! I dropped it.

Delete commented-out imports in report_test.go

Done.

The OverlappingCollections test cases appear nearly identical - maybe they could be combined into a parameterized test along the lines of TestPermission in source:services/arv-git-httpd/auth_handler_test.go

Done.

The manifests in the test cases appear to have hard-coded permission signatures -- signature ending in @5f0de808 will expire on Tue Jul 14 13:14:48 EDT 2020 which might cause tests to fail. Since NewClientFromEnv gives you an admin token for tests, can you leave off the +A signatures entirely?

Nice, done.

Feature suggestion / scope creep: instead of relying on the user to provide PDHs for efficiency, the tool could start by doing a collection list with select=pdh, filters=[uuid,in,[...]] (which should always be fast), then deduplicate by pdh, then fetch the manifests.

Yeah; if it's OK I'll leave that as a future improvement.

All fixes at 544d7aae0d58a25e8c761c638167c3564de06af5 on branch 16573-keep-deduplication-reporting-tool.

#8 Updated by Tom Clegg over 1 year ago

Just a couple of nits

logger.Errorf("...\n") doesn't need "\n" (unlike fmt.Fprintf())

might as well delete this:

+       //c.Check(stdout.String(), check.Equals, "")

Rest LGTM, thanks!

#9 Updated by Ward Vandewege over 1 year ago

  • Status changed from In Progress to Resolved

I've made those changes and merged, thanks.

#10 Updated by Ward Vandewege about 1 year ago

  • Release set to 25

Also available in: Atom PDF