Project

General

Profile

Expiring collections » History » Version 15

Peter Amstutz, 09/28/2016 03:42 PM

1 1 Tom Clegg
h1. Expiring collections
2
3
h2. Overview
4
5
Deleting a collection is not an instantaneous operation. Rather, a collection can be set to expire at some future time. Until that time arrives, its data blocks are still considered valuable: a client can "recover from trash" by clearing the expiry flag.
6
7
This addresses (at least) three desirable features:
8
9
A client should be able to undo a "delete collection" operation that was done by a different client. For example, it should be possible to delete a collection using arv-mount, then recover it using Workbench.
10
11
Automated processes need temp/scratch space: a mechanism to protect data _temporarily_ from the garbage collector, without cluttering any user's account. Arvados should not require applications to do things like make "temp" subprojects and set timers to clean up old data.
12
13
It should not be possible to do a series of collection operations that results in "lost" blocks. Example:
14
# Get old collection A (with signed manifest)
15
# Delete old collection A
16
# (garbage collector runs now)
17
# Create new collection B (using the signed manifest from collection A)
18
19
h2. Background: existing race window
20
21
Keep's garbage collection strategy relies on a "race window": new unreferenced data cannot be deleted, because there is necessarily a time interval between getting a signature from a Keep server (by writing the data) and using that signature to add the block to a collection.
22
23
A timestamp signature from a keepstore server means "this data will not be deleted until the given timestamp": before giving out a signature, keepstore updates the mtime of the block on disk, and (even if asked by datamanager/keep-balance) refuses to delete blocks that are too new. This means the API server can safely store a collection without checking whether the referenced data blocks actually exist: if the timestamps are current, the blocks can't have been garbage-collected.
24
25
The expires_at behavior described here should help the API server offer a similar guarantee ("a signature expiring at time T means the data will not be deleted until T").
26
27 5 Tom Clegg
h2. Interpreting expires_at
28 1 Tom Clegg
29
Each collection has an expires_at field.
30
31 14 Peter Amstutz
|expires_at|significance|get (pdh)  |get (uuid) |appears in default list|appear in list with include_expired=true       |can be modified|
32 1 Tom Clegg
|null      |persistent  |yes        |yes        |yes                    |yes                                            |yes            |
33 15 Peter Amstutz
|now < expires_at      |temporary/scratch    |yes&dagger;|yes&dagger;|yes&ddagger;           |yes                                            |yes            |
34
|now < expires_at+trashtime    |trashed, recoverable     |no         |no         |no                     |yes                                             |only expires_at             |
35
|now >= expires_at+trashtime   |trashed, unrecoverable     |no         |no         |no                     |no                                             |no             |
36 1 Tom Clegg
37 11 Tom Clegg
&dagger; If expires_at is not null, any signatures given in a get/list response must expire before expires_at.
38 3 Tom Clegg
39 11 Tom Clegg
&ddagger; Clients (notably arv-mount and Workbench) will need updates to behave appropriately when *expiring* collections are present -- e.g., use expires_at filters when requesting collection lists, or show visual cues for transient collections. Tools like "arv-get" and "arv keep ls" should work as usual on expiring collections, although in interactive settings a warning message might be appropriate.
40 1 Tom Clegg
41 10 Tom Clegg
*Expired* collections are effectively deleted (whether/when the system deletes the rows from the underlying database table is an implementation detail).
42 5 Tom Clegg
43
h2. Updating expires_at
44
45 13 Tom Clegg
When a user asks to delete a collection, by default the collection should not be deleted outright. Instead, its expires_at time should be set to @(now + defaultExpiryWindow)@ (or left alone, if it is already non-null and earlier than that default).
46 1 Tom Clegg
47 13 Tom Clegg
A default expiry window should be advertised in the discovery document.
48 5 Tom Clegg
49 12 Tom Clegg
A client can also set/clear expires_at explicitly in arvados.v1.collections.create or arvados.v1.collections.update. The given expires_at, if not null, can be any valid timestamp. If the client provides a timestamp in the past, the server should transparently change it to the current time: this will have the same effect as a time in the past, but will make more sense in the logs.
50
51 13 Tom Clegg
On an expiring collection, setting expires_at to null accomplishes "un-trash".
52 7 Tom Clegg
53 13 Tom Clegg
It is not possible to un-trash an expired collection: an update request returns 404.
54 1 Tom Clegg
55 4 Tom Clegg
h2. Unique name index
56
57
After deleting a collection named "foo", it must be possible to create a new collection named "foo" in the same project without a name collision.
58
59
Two possible approaches:
60
61
# When expiring a collection, stash the original name somewhere and change its name to something unique (e.g., incorporating uuid and timestamp).
62
# Convert the database index to a partial index, so names only have to be unique among non-deleted items. (Disadvantage: arv-mount will not (always) be able to use the "name" field of an expiring collection as its filename in a trash directory.)
63 1 Tom Clegg
64 4 Tom Clegg
In any case, an application that _undeletes_ collections must be prepared to encounter name conflicts.
65 5 Tom Clegg
* It may help here to add the "ensure_unique_name" feature to the "update" method (currently it is only available in "create").
66 1 Tom Clegg
67
h2. Client behavior
68
69
Workbench should not normally display collections with @(expires_at is not null)@. A "view trash" feature would be useful, though.
70
71
arv-mount should not normally list collections with @(expires_at is not null)@. A "trash directory" feature would be useful, though.
72
73
datamanager/keep-balance must not delete data blocks that are referenced by any collection with @(expires_at is null or expires_at>now)@.
74
75 5 Tom Clegg
h2. Collection modifications vs. consistency
76 1 Tom Clegg
77 5 Tom Clegg
In order to guarantee "permission signature timestamp T == no garbage collection until T", garbage collection must take into account blocks that were _recently_ referenced by collections.
78 1 Tom Clegg
79 10 Tom Clegg
(This guarantee is fundamentally at odds with an important admin feature, [[Expedited delete]]: an admin should have a mechanism to accelerate garbage collection. Ideally, this action can be restricted to the blocks from a specific deleted collection.)
80 5 Tom Clegg
81 10 Tom Clegg
Datamanager/keep-balance can use arvados.v1.logs.index to get older versions of each manifest that has been changed or deleted recently (<= blobSignatureTTL seconds ago).
82 5 Tom Clegg
83 10 Tom Clegg
In order to accomplish "expedited delete" (without backdating or deleting log table entries, which would confuse other uses of event logs) the admin tool will need to do a focused garbage collection operation itself: it won't be enough to expire/delete the collection record right away.  The most powerful/immediate variations of "expedited delete" will need to work this way anyway, though, in order to bypass the usual "do not delete blocks newer than permission TTL" restriction for a specific set of affected blocks.
84 5 Tom Clegg
85 2 Tom Clegg
h2. Related: replication_desired=0
86 1 Tom Clegg
87
A collection with replication_desired=0 does not protect its data from garbage collection. In this sense, replication_desired=0 is similar to expires_at<now.
88
89
However, replication_desired=0 does not mean the collection record itself should be hidden. It means the collection metadata (filenames, sizes, data hashes, collection PDH) are valuable enough to keep on hand, but the data itself isn't. For example, if we delete intermediate data generated by a workflow, and find later that the same workflow now produces a different result, it would be helpful to see which of the intermediate outputs differed.
90
91 8 Tom Clegg
h2. TBD
92
93
When deleting a project that contains expiring or persistent collections, presumably the persistent collections should become expiring collections, but what should their new owner_uuid be?
94 10 Tom Clegg
* Proposed solution: projects themselves also need an expires_at field that works the same way.