Story #16427

"undelete" command to recover trashed blocks and restore a deleted collection

Added by Tom Clegg 10 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
06/01/2020
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
3.0
Release relationship:
Auto

Description

In some cases, even when a collection has been deleted and keep-balance has trashed its data blocks, the collection is still partially or fully recoverable:
  • some blocks may not have been trashed because they are still referenced by other collections
  • some blocks may still be in the recoverable "trashed" state, if BlobTrashLifetime has not arrived yet

Given a manifest (or the UUID of a a collection update/delete log entry that has a manifest in old_attributes), the recovery command ("arvados-server undelete") should use keepstore's HEAD and untrash APIs to untrash the data as needed, make a new manifest with fresh signatures, and save a new collection.

If a block is not recoverable, the command should continue to untrash as many as it can, and report how many were attempted/successful, but not save a new collection. (Future improvement: save a partial collection in this case, omitting any files affected by the missing blocks.)

16427-doc.png (205 KB) 16427-doc.png Tom Clegg, 06/05/2020 09:16 PM

Subtasks

Task #16454: Review 16427-undeleteResolvedTom Clegg


Related issues

Related to Arvados - Support #16421: [doc] document deletion lifecycle of collections, and steps to undelete collectionsResolved09/02/2020

Related to Arvados Epics - Story #16514: Actionable insight into keep usageNew09/01/202112/31/2021

Associated revisions

Revision bcf8d387
Added by Tom Clegg 9 months ago

Merge branch '16427-undelete'

refs #16427

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

Revision 9447d2f3
Added by Tom Clegg 9 months ago

Merge branch '16427-undelete'

closes #16427

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

History

#1 Updated by Tom Clegg 10 months ago

  • Story points set to 3.0

#2 Updated by Peter Amstutz 10 months ago

  • Target version set to 2020-06-03 Sprint

#4 Updated by Tom Clegg 10 months ago

  • Description updated (diff)

#5 Updated by Tom Clegg 10 months ago

  • Status changed from New to In Progress

work in progress at 16427-undelete

#6 Updated by Peter Amstutz 9 months ago

  • Assigned To set to Tom Clegg

#7 Updated by Ward Vandewege 9 months ago

  • Related to Support #16421: [doc] document deletion lifecycle of collections, and steps to undelete collections added

#8 Updated by Tom Clegg 9 months ago

16427-undelete @ 167e2537365ec68fe6be6a330a2eb698f177aa05 -- https://ci.arvados.org/view/Developer/job/developer-run-tests/1879/

TODO: accept UUID of collection-delete/update log entry (currently must provide file containing manifest text)
TODO: doc page

#9 Updated by Ward Vandewege 9 months ago

Review comments:

lib/undelete/cmd.go

+ func (und undeleter) newestMtime(logger logrus.FieldLogger, blk string, svc arvados.KeepService) (time.Time, bool) {

It would be nice to follow the Go error pattern instead, and make the second arg an 'error'. Presumably, block not on svc is an error too!

+func (und undeleter) ensureSafe(ctx context.Context, logger logrus.FieldLogger, blk string, svc arvados.KeepService, blobsigttl time.Duration, blobsigexp time.Time) bool {

Same - why not return an error instead of a bool?

+func (und undeleter) RecoverManifest(mtxt string) (string, error) {

Is there a particular reason why half a BlobSigningTTL is a good choice for the save deadline? Can the reason go into a comment?

+ for i := 0; i < 2*len(services); i++ {

It would be great to make it a bit more clear that these are worker threads, and that the number of threads is being sized by the number of keepstore nodes (rather than the cores on the client, or something). Maybe something like

"workerThreads := 2 * len(services)"

+ if havenot > 0 {
+ if have > 0 {
+ und.logger.Warn("partial recovery is not implemented")
+ }
+ return "", fmt.Errorf("unable to recover %d of %d blocks", havenot, have+havenot)
+ }

Partial unrecovery will be nice. Just listing the affected blocks, could also be nice. Blocks + affected filenames even better. That should probably be part of the partial recovery feature.

lib/undelete/cmd_test.go

+func (*Suite) TestUntrashAndTouchBlock(c *check.C) {

+ if fi, err := os.Stat(datadir); err != nil || !fi.IsDir() {
+ c.Logf("skipping datadir %q, evidently neither keepstore is using it", datadir)
+ continue
+ }

This seems like it could generate unnecessary noise in the tests, no? Can we make this debug-only, or even just skip the logging altogether? It doesn't seem super useful.

Functionally: running arvados-server with the `undelete` command prints this, which seems unhelpful:

$ ./arvados-server undelete
Usage: ./arvados-server undelete [options] uuid_or_file ...
-config file
Site configuration file (default may be overridden by setting an ARVADOS_CONFIG environment variable) (default "/etc/arvados/config.yml")
-legacy-crunch-dispatch-slurm-config file
Legacy crunch-dispatch-slurm configuration file (default "/etc/arvados/crunch-dispatch-slurm/crunch-dispatch-slurm.yml")
-legacy-git-httpd-config file
Legacy arv-git-httpd configuration file (default "/etc/arvados/git-httpd/git-httpd.yml")
-legacy-keepbalance-config file
Legacy keep-balance configuration file (default "/etc/arvados/keep-balance/keep-balance.yml")
-legacy-keepproxy-config file
Legacy keepproxy configuration file (default "/etc/arvados/keepproxy/keepproxy.yml")
-legacy-keepstore-config file
Legacy keepstore configuration file (default "/etc/arvados/keepstore/keepstore.yml")
-legacy-keepweb-config file
Legacy keep-web configuration file (default "/etc/arvados/keep-web/keep-web.yml")
-legacy-ws-config file
Legacy arvados-ws configuration file (default "/etc/arvados/ws/ws.yml")
-log-level string
logging level (debug, info, ...) (default "info")
-skip-legacy
Don't load legacy config files
INFO[2020-06-03T09:59:49.409731940-04:00] exiting

Can we have some help text?

Otherwise, LGTM so far!

#10 Updated by Tom Clegg 9 months ago

  • Target version changed from 2020-06-03 Sprint to 2020-06-17 Sprint

#11 Updated by Tom Clegg 9 months ago

Ward Vandewege wrote:

It would be nice to follow the Go error pattern instead, and make the second arg an 'error'. Presumably, block not on svc is an error too!

Not exactly (we might go on to find it on a different server), but yes, I've updated both funcs to return error instead of an ok bool.

Is there a particular reason why half a BlobSigningTTL is a good choice for the save deadline? Can the reason go into a comment?

Yes, added comment.

+ for i := 0; i < 2*len(services); i++ {

Added explanatory comment for this too.

Partial unrecovery will be nice. Just listing the affected blocks, could also be nice. Blocks + affected filenames even better. That should probably be part of the partial recovery feature.

Yeah, I think the first "partial recovery" feature would remove any files that are affected by the missing blocks, and save the files that are still intact, and exit non-zero.

Manifests don't support sparse files, so our only options for saving partial files seem to be zero-padding and truncating, so if you start with "abcdefghi" and "def" goes missing, you can recover either "abcghi" or "abc\0\0\0ghi". (I don't think we're going to implement either right away though.)

This seems like it could generate unnecessary noise in the tests, no? Can we make this debug-only, or even just skip the logging altogether? It doesn't seem super useful.

Yes, it only shows up when the test fails or you run -check.vv, but I changed it to print the datadirs that are used instead of the ones that aren't, which is pretty much the same information but less mysterious.

Functionally: running arvados-server with the `undelete` command prints this, which seems unhelpful:

Removed the legacy config flags, added more explanation, and made the "Usage: ..." part appear in the "bad args" case as well as the "no positional args" case.

Usage:  
        ./tmp/GOPATH/bin/arvados-server undelete [options ...] /path/to/manifest.txt [...]

        This program recovers deleted collections. Recovery is
        possible when the collection's manifest is still available and
        all of its data blocks are still available or recoverable
        (e.g., garbage collection is not enabled, the blocks are too
        new for garbage collection, the blocks are referenced by other
        collections, or the blocks have been trashed but not yet
        deleted).

        For each provided collection manifest, once all data blocks
        are recovered/protected from garbage collection, a new
        collection is saved and its UUID is printed on stdout.

        Exit status will be zero if recovery is successful, i.e., a
        collection is saved for each provided manifest.
Options:
  -config file
        Site configuration file (default may be overridden by setting an ARVADOS_CONFIG environment variable) (default "/etc/arvados/config.yml")
  -log-level string
        logging level (debug, info, ...) (default "info")

16427-undelete @ f525d01de971b8b06dabc827bc0bbf46ee7ce9cc -- https://ci.arvados.org/view/Developer/job/developer-run-tests/1892/

and with master merged:

16427-undelete @ 2be95ce7f069d9eb131b0d2d922a5e556f75810c -- https://ci.arvados.org/view/Developer/job/developer-run-tests/1893/

#12 Updated by Ward Vandewege 9 months ago

Tom Clegg wrote:

Ward Vandewege wrote:

It would be nice to follow the Go error pattern instead, and make the second arg an 'error'. Presumably, block not on svc is an error too!

Not exactly (we might go on to find it on a different server), but yes, I've updated both funcs to return error instead of an ok bool.

Cool. Looks like the comments for both functions still needs updating.

Partial unrecovery will be nice. Just listing the affected blocks, could also be nice. Blocks + affected filenames even better. That should probably be part of the partial recovery feature.

Yeah, I think the first "partial recovery" feature would remove any files that are affected by the missing blocks, and save the files that are still intact, and exit non-zero.

Yes. That would be very useful already.

Manifests don't support sparse files, so our only options for saving partial files seem to be zero-padding and truncating, so if you start with "abcdefghi" and "def" goes missing, you can recover either "abcghi" or "abc\0\0\0ghi". (I don't think we're going to implement either right away though.)

I would probably prefer the zero padding approach (because it shows exactly what sections are missing). I agree that this could be a follow on story.

This seems like it could generate unnecessary noise in the tests, no? Can we make this debug-only, or even just skip the logging altogether? It doesn't seem super useful.

Yes, it only shows up when the test fails or you run -check.vv, but I changed it to print the datadirs that are used instead of the ones that aren't, which is pretty much the same information but less mysterious.

Perfect, thanks.

Functionally: running arvados-server with the `undelete` command prints this, which seems unhelpful:

Removed the legacy config flags, added more explanation, and made the "Usage: ..." part appear in the "bad args" case as well as the "no positional args" case.

Excellent thank you.

LGTM with those function comments updated. Thanks!

#13 Updated by Tom Clegg 9 months ago

This branch also adds the "get old manifest from log entry UUID" feature.

16427-undelete @ 570c793f1aa43fd9763a0368554dd395dacdd238 -- https://ci.arvados.org/view/Developer/job/developer-run-tests/1897/

Should we also add "provide collection UUID -> get old version from most recent log entry concerning that collection" ...? That would probably be the most convenient way to do the most obvious case, i.e., undo the last change, whether or not it was a "delete".

#14 Updated by Ward Vandewege 9 months ago

Tom Clegg wrote:

Nice! I like the documentation.

This branch also adds the "get old manifest from log entry UUID" feature.

16427-undelete @ 570c793f1aa43fd9763a0368554dd395dacdd238 -- https://ci.arvados.org/view/Developer/job/developer-run-tests/1897/

Yeah that LGTM. The test failure looks unrelated, do you agree?

Should we also add "provide collection UUID -> get old version from most recent log entry concerning that collection" ...? That would probably be the most convenient way to do the most obvious case, i.e., undo the last change, whether or not it was a "delete".

Yes, that would be nice. Should we call that 'undo' instead of 'undelete' ?

#15 Updated by Tom Clegg 9 months ago

Should we also add "provide collection UUID -> get old version from most recent log entry concerning that collection" ...? That would probably be the most convenient way to do the most obvious case, i.e., undo the last change, whether or not it was a "delete".

Yes, that would be nice. Should we call that 'undo' instead of 'undelete' ?

I think "undo" isn't quite right because we're (intentionally) saving the recovered data elsewhere, rather than restoring the same collection UUID back to an earlier state. Same reasoning applies to calling the current functionality "undelete" for that matter -- it doesn't actually make the collection UUID come back, there's no attempt to restore name/metadata, etc. Would "recover-collection-data" be better, and make sense whether passing a log uuid, collection uuid, or manifest saved in a file? It's a bit unwieldy/verbose but I'm thinking that's ok for something only used in rare/unusual cases.

#16 Updated by Ward Vandewege 9 months ago

Tom Clegg wrote:

Should we also add "provide collection UUID -> get old version from most recent log entry concerning that collection" ...? That would probably be the most convenient way to do the most obvious case, i.e., undo the last change, whether or not it was a "delete".

Yes, that would be nice. Should we call that 'undo' instead of 'undelete' ?

I think "undo" isn't quite right because we're (intentionally) saving the recovered data elsewhere, rather than restoring the same collection UUID back to an earlier state. Same reasoning applies to calling the current functionality "undelete" for that matter -- it doesn't actually make the collection UUID come back, there's no attempt to restore name/metadata, etc. Would "recover-collection-data" be better, and make sense whether passing a log uuid, collection uuid, or manifest saved in a file? It's a bit unwieldy/verbose but I'm thinking that's ok for something only used in rare/unusual cases.

Yes, good call. You could shorten it to recover-collection if you want.

#17 Updated by Peter Amstutz 9 months ago

  • Related to Story #16514: Actionable insight into keep usage added

#18 Updated by Tom Clegg 9 months ago

16427-undelete @ e48478841828b1dbab8b69eb9453db23b42ed63f -- https://ci.arvados.org/view/Developer/job/developer-run-tests/1913/
  • rename subcommand to "recover-collection"
  • accept collection UUID (equivalent to passing UUID of most recent log entry with matching object_uuid)

#19 Updated by Ward Vandewege 9 months ago

Tom Clegg wrote:

16427-undelete @ e48478841828b1dbab8b69eb9453db23b42ed63f -- https://ci.arvados.org/view/Developer/job/developer-run-tests/1913/
  • rename subcommand to "recover-collection"
  • accept collection UUID (equivalent to passing UUID of most recent log entry with matching object_uuid)

LGTM thanks!

#20 Updated by Anonymous 9 months ago

  • Status changed from In Progress to Resolved

#21 Updated by Peter Amstutz 5 months ago

  • Release set to 25

Also available in: Atom PDF