Idea #16427: "undelete" command to recover trashed blocks and restore a deleted collection - Arvados

Custom queries

All assigned issues
All issues assigned for grooming
My issues for grooming
My issues for grooming (no story pts)
Prioritized open issues

Actions

Copy link

Idea #16427

closed

"undelete" command to recover trashed blocks and restore a deleted collection

Added by Tom Clegg over 4 years ago. Updated about 4 years ago.

Status:

Resolved

Priority:

Normal

Assigned To:

Tom Clegg

Category:

Target version:

2020-06-17 Sprint

Start date:

06/01/2020

Due date:

Story points:

3.0

Release:

Arvados 2.1.0

Release relationship:

Auto

Description

In some cases, even when a collection has been deleted and keep-balance has trashed its data blocks, the collection is still partially or fully recoverable:

some blocks may not have been trashed because they are still referenced by other collections
some blocks may still be in the recoverable "trashed" state, if BlobTrashLifetime has not arrived yet

Given a manifest (or the UUID of a a collection update/delete log entry that has a manifest in old_attributes), the recovery command ("arvados-server undelete") should use keepstore's HEAD and untrash APIs to untrash the data as needed, make a new manifest with fresh signatures, and save a new collection.

If a block is not recoverable, the command should continue to untrash as many as it can, and report how many were attempted/successful, but not save a new collection. (Future improvement: save a partial collection in this case, omitting any files affected by the missing blocks.)

Files

16427-doc.png (205 KB) 16427-doc.png

Tom Clegg, 06/05/2020 09:16 PM

Subtasks 1 (0 open — 1 closed)

Task #16454: Review 16427-undelete

Resolved

Tom Clegg

06/01/2020

Actions

Related issues

	Related to Arvados - Support #16421: [doc] document deletion lifecycle of collections, and steps to undelete collections	Resolved	Ward Vandewege	09/02/2020			Actions
	Related to Arvados Epics - Idea #16514: Actionable insight into keep usage	New					Actions

Issue # Delay: days Cancel

History
Notes
Property changes
Associated revisions

Actions

Copy link

Updated by Tom Clegg over 4 years ago

Story points set to 3.0

Actions

Copy link

Updated by Peter Amstutz over 4 years ago

Target version set to 2020-06-03 Sprint

Actions

Copy link

Updated by Tom Clegg over 4 years ago

Description updated (diff)

Actions

Copy link

Updated by Tom Clegg over 4 years ago

Status changed from New to In Progress

work in progress at 16427-undelete

Actions

Copy link

Updated by Peter Amstutz over 4 years ago

Assigned To set to Tom Clegg

Actions

Copy link

Updated by Ward Vandewege over 4 years ago

Related to Support #16421: [doc] document deletion lifecycle of collections, and steps to undelete collections added

Actions

Copy link

Updated by Tom Clegg over 4 years ago

16427-undelete @ 167e2537365ec68fe6be6a330a2eb698f177aa05 -- developer-run-tests: #1879

TODO: accept UUID of collection-delete/update log entry (currently must provide file containing manifest text)
TODO: doc page

Actions

Copy link

Updated by Ward Vandewege over 4 years ago

Review comments:

lib/undelete/cmd.go

+ func (und undeleter) newestMtime(logger logrus.FieldLogger, blk string, svc arvados.KeepService) (time.Time, bool) {

It would be nice to follow the Go error pattern instead, and make the second arg an 'error'. Presumably, block not on svc is an error too!

+func (und undeleter) ensureSafe(ctx context.Context, logger logrus.FieldLogger, blk string, svc arvados.KeepService, blobsigttl time.Duration, blobsigexp time.Time) bool {

Same - why not return an error instead of a bool?

+func (und undeleter) RecoverManifest(mtxt string) (string, error) {

Is there a particular reason why half a BlobSigningTTL is a good choice for the save deadline? Can the reason go into a comment?

+ for i := 0; i < 2*len(services); i++ {

It would be great to make it a bit more clear that these are worker threads, and that the number of threads is being sized by the number of keepstore nodes (rather than the cores on the client, or something). Maybe something like

"workerThreads := 2 * len(services)"

+ if havenot > 0 {
+ if have > 0 {
+ und.logger.Warn("partial recovery is not implemented")
+ }
+ return "", fmt.Errorf("unable to recover %d of %d blocks", havenot, have+havenot)
+ }

Partial unrecovery will be nice. Just listing the affected blocks, could also be nice. Blocks + affected filenames even better. That should probably be part of the partial recovery feature.

lib/undelete/cmd_test.go

+func (*Suite) TestUntrashAndTouchBlock(c *check.C) {

+ if fi, err := os.Stat(datadir); err != nil || !fi.IsDir() {
+ c.Logf("skipping datadir %q, evidently neither keepstore is using it", datadir)
+ continue
+ }

This seems like it could generate unnecessary noise in the tests, no? Can we make this debug-only, or even just skip the logging altogether? It doesn't seem super useful.

Functionally: running arvados-server with the `undelete` command prints this, which seems unhelpful:

$ ./arvados-server undelete
Usage: ./arvados-server undelete [options] uuid_or_file ...
-config file
Site configuration file (default may be overridden by setting an ARVADOS_CONFIG environment variable) (default "/etc/arvados/config.yml")
-legacy-crunch-dispatch-slurm-config file
Legacy crunch-dispatch-slurm configuration file (default "/etc/arvados/crunch-dispatch-slurm/crunch-dispatch-slurm.yml")
-legacy-git-httpd-config file
Legacy arv-git-httpd configuration file (default "/etc/arvados/git-httpd/git-httpd.yml")
-legacy-keepbalance-config file
Legacy keep-balance configuration file (default "/etc/arvados/keep-balance/keep-balance.yml")
-legacy-keepproxy-config file
Legacy keepproxy configuration file (default "/etc/arvados/keepproxy/keepproxy.yml")
-legacy-keepstore-config file
Legacy keepstore configuration file (default "/etc/arvados/keepstore/keepstore.yml")
-legacy-keepweb-config file
Legacy keep-web configuration file (default "/etc/arvados/keep-web/keep-web.yml")
-legacy-ws-config file
Legacy arvados-ws configuration file (default "/etc/arvados/ws/ws.yml")
-log-level string
logging level (debug, info, ...) (default "info")
-skip-legacy
Don't load legacy config files
INFO[2020-06-03T09:59:49.409731940-04:00] exiting

Can we have some help text?

Otherwise, LGTM so far!

Actions

Copy link

#10

Updated by Tom Clegg over 4 years ago

Target version changed from 2020-06-03 Sprint to 2020-06-17 Sprint

Actions

Copy link

#11

Updated by Tom Clegg over 4 years ago

Ward Vandewege wrote:

It would be nice to follow the Go error pattern instead, and make the second arg an 'error'. Presumably, block not on svc is an error too!

Not exactly (we might go on to find it on a different server), but yes, I've updated both funcs to return error instead of an ok bool.

Is there a particular reason why half a BlobSigningTTL is a good choice for the save deadline? Can the reason go into a comment?

Yes, added comment.

+ for i := 0; i < 2*len(services); i++ {

Added explanatory comment for this too.

Partial unrecovery will be nice. Just listing the affected blocks, could also be nice. Blocks + affected filenames even better. That should probably be part of the partial recovery feature.

Yeah, I think the first "partial recovery" feature would remove any files that are affected by the missing blocks, and save the files that are still intact, and exit non-zero.

Manifests don't support sparse files, so our only options for saving partial files seem to be zero-padding and truncating, so if you start with "abcdefghi" and "def" goes missing, you can recover either "abcghi" or "abc\0\0\0ghi". (I don't think we're going to implement either right away though.)

This seems like it could generate unnecessary noise in the tests, no? Can we make this debug-only, or even just skip the logging altogether? It doesn't seem super useful.

Yes, it only shows up when the test fails or you run -check.vv, but I changed it to print the datadirs that are used instead of the ones that aren't, which is pretty much the same information but less mysterious.

Functionally: running arvados-server with the `undelete` command prints this, which seems unhelpful:

Removed the legacy config flags, added more explanation, and made the "Usage: ..." part appear in the "bad args" case as well as the "no positional args" case.

Usage:  
        ./tmp/GOPATH/bin/arvados-server undelete [options ...] /path/to/manifest.txt [...]

        This program recovers deleted collections. Recovery is
        possible when the collection's manifest is still available and
        all of its data blocks are still available or recoverable
        (e.g., garbage collection is not enabled, the blocks are too
        new for garbage collection, the blocks are referenced by other
        collections, or the blocks have been trashed but not yet
        deleted).

        For each provided collection manifest, once all data blocks
        are recovered/protected from garbage collection, a new
        collection is saved and its UUID is printed on stdout.

        Exit status will be zero if recovery is successful, i.e., a
        collection is saved for each provided manifest.
Options:
  -config file
        Site configuration file (default may be overridden by setting an ARVADOS_CONFIG environment variable) (default "/etc/arvados/config.yml")
  -log-level string
        logging level (debug, info, ...) (default "info")

16427-undelete @ f525d01de971b8b06dabc827bc0bbf46ee7ce9cc -- developer-run-tests: #1892

and with master merged:

16427-undelete @ 2be95ce7f069d9eb131b0d2d922a5e556f75810c -- developer-run-tests: #1893

Actions

Copy link

#12

Updated by Ward Vandewege over 4 years ago

Tom Clegg wrote:

Ward Vandewege wrote:

It would be nice to follow the Go error pattern instead, and make the second arg an 'error'. Presumably, block not on svc is an error too!

Not exactly (we might go on to find it on a different server), but yes, I've updated both funcs to return error instead of an ok bool.

Cool. Looks like the comments for both functions still needs updating.

Partial unrecovery will be nice. Just listing the affected blocks, could also be nice. Blocks + affected filenames even better. That should probably be part of the partial recovery feature.

Yeah, I think the first "partial recovery" feature would remove any files that are affected by the missing blocks, and save the files that are still intact, and exit non-zero.

Yes. That would be very useful already.

Manifests don't support sparse files, so our only options for saving partial files seem to be zero-padding and truncating, so if you start with "abcdefghi" and "def" goes missing, you can recover either "abcghi" or "abc\0\0\0ghi". (I don't think we're going to implement either right away though.)

I would probably prefer the zero padding approach (because it shows exactly what sections are missing). I agree that this could be a follow on story.

This seems like it could generate unnecessary noise in the tests, no? Can we make this debug-only, or even just skip the logging altogether? It doesn't seem super useful.

Yes, it only shows up when the test fails or you run -check.vv, but I changed it to print the datadirs that are used instead of the ones that aren't, which is pretty much the same information but less mysterious.

Perfect, thanks.

Functionally: running arvados-server with the `undelete` command prints this, which seems unhelpful:

Removed the legacy config flags, added more explanation, and made the "Usage: ..." part appear in the "bad args" case as well as the "no positional args" case.

Excellent thank you.

LGTM with those function comments updated. Thanks!

Actions

Copy link

#13

Updated by Tom Clegg over 4 years ago

File 16427-doc.png 16427-doc.png added

This branch also adds the "get old manifest from log entry UUID" feature.

16427-undelete @ 570c793f1aa43fd9763a0368554dd395dacdd238 -- developer-run-tests: #1897

Should we also add "provide collection UUID -> get old version from most recent log entry concerning that collection" ...? That would probably be the most convenient way to do the most obvious case, i.e., undo the last change, whether or not it was a "delete".

Actions

Copy link

#14

Updated by Ward Vandewege over 4 years ago

Tom Clegg wrote:

Nice! I like the documentation.

This branch also adds the "get old manifest from log entry UUID" feature.

16427-undelete @ 570c793f1aa43fd9763a0368554dd395dacdd238 -- developer-run-tests: #1897

Yeah that LGTM. The test failure looks unrelated, do you agree?

Should we also add "provide collection UUID -> get old version from most recent log entry concerning that collection" ...? That would probably be the most convenient way to do the most obvious case, i.e., undo the last change, whether or not it was a "delete".

Yes, that would be nice. Should we call that 'undo' instead of 'undelete' ?

Actions

Copy link

#15

Updated by Tom Clegg over 4 years ago

Should we also add "provide collection UUID -> get old version from most recent log entry concerning that collection" ...? That would probably be the most convenient way to do the most obvious case, i.e., undo the last change, whether or not it was a "delete".

Yes, that would be nice. Should we call that 'undo' instead of 'undelete' ?

I think "undo" isn't quite right because we're (intentionally) saving the recovered data elsewhere, rather than restoring the same collection UUID back to an earlier state. Same reasoning applies to calling the current functionality "undelete" for that matter -- it doesn't actually make the collection UUID come back, there's no attempt to restore name/metadata, etc. Would "recover-collection-data" be better, and make sense whether passing a log uuid, collection uuid, or manifest saved in a file? It's a bit unwieldy/verbose but I'm thinking that's ok for something only used in rare/unusual cases.

Actions

Copy link

#16

Updated by Ward Vandewege over 4 years ago

Tom Clegg wrote:

Should we also add "provide collection UUID -> get old version from most recent log entry concerning that collection" ...? That would probably be the most convenient way to do the most obvious case, i.e., undo the last change, whether or not it was a "delete".

Yes, that would be nice. Should we call that 'undo' instead of 'undelete' ?

I think "undo" isn't quite right because we're (intentionally) saving the recovered data elsewhere, rather than restoring the same collection UUID back to an earlier state. Same reasoning applies to calling the current functionality "undelete" for that matter -- it doesn't actually make the collection UUID come back, there's no attempt to restore name/metadata, etc. Would "recover-collection-data" be better, and make sense whether passing a log uuid, collection uuid, or manifest saved in a file? It's a bit unwieldy/verbose but I'm thinking that's ok for something only used in rare/unusual cases.

Yes, good call. You could shorten it to recover-collection if you want.

Actions

Copy link

#17

Updated by Peter Amstutz over 4 years ago

Related to Idea #16514: Actionable insight into keep usage added

Actions

Copy link

#18

Updated by Tom Clegg over 4 years ago

16427-undelete @ e48478841828b1dbab8b69eb9453db23b42ed63f -- developer-run-tests: #1913

rename subcommand to "recover-collection"
accept collection UUID (equivalent to passing UUID of most recent log entry with matching object_uuid)

Actions

Copy link

#19

Updated by Ward Vandewege over 4 years ago

Tom Clegg wrote:

16427-undelete @ e48478841828b1dbab8b69eb9453db23b42ed63f -- developer-run-tests: #1913

rename subcommand to "recover-collection"

accept collection UUID (equivalent to passing UUID of most recent log entry with matching object_uuid)

LGTM thanks!

Actions

Copy link

#20

Updated by Anonymous over 4 years ago

Status changed from In Progress to Resolved

Applied in changeset arvados|9447d2f3bd4c15a4b3bda16dc3ddef0faeb68771.

Actions

Copy link

#21

Updated by Peter Amstutz about 4 years ago

Release set to 25

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Arvados

Custom queries

Idea #16427

"undelete" command to recover trashed blocks and restore a deleted collection

Updated by Tom Clegg over 4 years ago

Updated by Peter Amstutz over 4 years ago

Updated by Tom Clegg over 4 years ago

Updated by Tom Clegg over 4 years ago

Updated by Peter Amstutz over 4 years ago

Updated by Ward Vandewege over 4 years ago

Updated by Tom Clegg over 4 years ago

Updated by Ward Vandewege over 4 years ago

Updated by Tom Clegg over 4 years ago

Updated by Tom Clegg over 4 years ago

Updated by Ward Vandewege over 4 years ago

Updated by Tom Clegg over 4 years ago

Updated by Ward Vandewege over 4 years ago

Updated by Tom Clegg over 4 years ago

Updated by Ward Vandewege over 4 years ago

Updated by Peter Amstutz over 4 years ago

Updated by Tom Clegg over 4 years ago

Updated by Ward Vandewege over 4 years ago

Updated by Anonymous over 4 years ago

Updated by Peter Amstutz about 4 years ago