Project

General

Profile

Expiring collections » History » Version 21

Tom Clegg, 12/23/2016 09:18 PM

1 1 Tom Clegg
h1. Expiring collections
2
3 19 Tom Clegg
{{toc}}
4
5 1 Tom Clegg
h2. Overview
6
7 19 Tom Clegg
In addition to the two obvious states ("preserved indefinitely" and "irreversibly deleted") Arvados can offer some more subtle persistence states for collections:
8 20 Tom Clegg
# An *expiring* collection (aka temporary, transient, scratch) has an expiry ("trash_at") time in the future, at which time it automatically moves to the trash
9
# A *trashed* collection is not visible or readable through normal data access APIs, but (until its "delete_at" time is reached) can be un-trashed by users
10 1 Tom Clegg
11 20 Tom Clegg
h2. Significance of trash_at and delete_at
12 1 Tom Clegg
13 20 Tom Clegg
Each collection has a trash_at field and a delete_at field.
14 1 Tom Clegg
15 20 Tom Clegg
|trash_at  |delete_at  |get (pdh or uuid) |get?include_trash=true |list        |list?include_trash=true |can be modified|
16
|null      |null       |yes               |yes                    |yes         |yes                     |yes            |
17
|future    |future     |yes†       |yes†            |yes† |yes†             |yes            |
18 21 Tom Clegg
|past      |future     |no                |yes‡           |no          |yes                     |only trash_at and delete_at|
19 20 Tom Clegg
|past      |past       |no                |no                     |no          |no                      |no             |
20 1 Tom Clegg
21 20 Tom Clegg
† If trash_at is not null, any signatures given in a get/list response must expire before trash_at.
22 1 Tom Clegg
23 21 Tom Clegg
† Clients (notably arv-mount and Workbench) will need updates to behave appropriately when collections have a "trash" timer set -- e.g., use trash_at filters when requesting collection lists, or show visual cues for transient collections. Tools like "arv-get" and "arv keep ls" should work as usual on transient collections, although in interactive settings a warning message might be appropriate.
24 1 Tom Clegg
25
‡ No signatures should be given in get/list responses.
26
27 19 Tom Clegg
"Trashed, unrecoverable" collections are effectively deleted. Whether/when the system deletes the rows from the underlying database table is an implementation detail invisible to clients.
28 16 Peter Amstutz
29 20 Tom Clegg
h2. Updating trash_at and delete_at
30 1 Tom Clegg
31 20 Tom Clegg
Values of trash_at and delete_at are constrained:
32
* If one is null, the other must be null too.
33
* 0 <= (delete_at - trash_at) <= api_config.max_trash_time
34 1 Tom Clegg
35 20 Tom Clegg
The arvados.v1.collections.delete API should set trash_at to @now@ instead of deleting the collection outright.
36 1 Tom Clegg
37 20 Tom Clegg
A client can also explicitly set/clear trash_at in arvados.v1.collections.create or arvados.v1.collections.update. The given trash_at, if not null, can be any valid timestamp. If the client provides a timestamp in the past, the server should transparently change it to the current time: this will make more sense in the logs, and ensures un-trash is possible for the duration indicated by the site-wide trashtime.
38 1 Tom Clegg
39 20 Tom Clegg
On an expired collection, setting trash_at and delete_at to null (or a future time) accomplishes "un-trash".
40 1 Tom Clegg
41 20 Tom Clegg
It is not possible to un-trash (or modify in any other way) a collection whose delete_at time has passed: an update request returns 404.
42
43 1 Tom Clegg
h2. Unique name index
44 11 Tom Clegg
45 19 Tom Clegg
After trashing a collection named "foo", it must be possible to create a new collection named "foo" in the same project without a name collision.
46 10 Tom Clegg
47 5 Tom Clegg
Two possible approaches:
48
49
# When expiring a collection, stash the original name somewhere and change its name to something unique (e.g., incorporating uuid and timestamp).
50 1 Tom Clegg
# Convert the database index to a partial index, so names only have to be unique among non-deleted items. (Disadvantage: arv-mount will not (always) be able to use the "name" field of an expiring collection as its filename in a trash directory.)
51
52 18 Tom Clegg
In any case, an application that _undeletes_ collections must be prepared to encounter name conflicts.
53 5 Tom Clegg
* It may help here to add the "ensure_unique_name" feature to the "update" method (currently it is only available in "create").
54 19 Tom Clegg
55 18 Tom Clegg
h2. User interface considerations
56 19 Tom Clegg
57 20 Tom Clegg
Workbench should indicate the difference between transient and permanent collections (e.g., make a visual distinction between null and non-null trash_at).
58 19 Tom Clegg
59 4 Tom Clegg
Workbench and arv-mount should provide a way to find and recover trashed collections.
60 1 Tom Clegg
61 19 Tom Clegg
h2. Garbage collection (keep-balance) considerations
62 4 Tom Clegg
63 19 Tom Clegg
It should not be possible to do a series of collection operations that results in "lost" blocks. Example:
64
# Get old collection A (with signed manifest)
65
# Delete old collection A
66
# (garbage collector runs now)
67
# Create new collection B (using the signed manifest from collection A)
68
69
h3. Background: race window
70
71 1 Tom Clegg
Keep's garbage collection strategy relies on a "race window": new unreferenced data cannot be deleted, because there is necessarily a time interval between getting a signature from a Keep server (by writing the data) and using that signature to add the block to a collection.
72 19 Tom Clegg
73
A timestamp signature from a keepstore server means "this data will not be deleted until the given timestamp": before giving out a signature, keepstore updates the mtime of the block on disk, and (even if asked by datamanager/keep-balance) refuses to delete blocks that are too new. This means the API server can safely store a collection without checking whether the referenced data blocks actually exist: if the timestamps are current, the blocks can't have been garbage-collected.
74
75 20 Tom Clegg
The trash_at/delete_at behavior described here should help the API server offer a similar guarantee ("a signature expiring at time T means the data will not be deleted until T").
76 19 Tom Clegg
77 1 Tom Clegg
h3. Collection modifications vs. consistency
78 19 Tom Clegg
79 20 Tom Clegg
(TODO: update to reflect above definitions of trash_at and delete_at)
80 19 Tom Clegg
81 5 Tom Clegg
In order to guarantee "permission signature timestamp T == no garbage collection until T", garbage collection must take into account blocks that were _recently_ referenced by collections.
82 1 Tom Clegg
83 19 Tom Clegg
> Aside: This guarantee is fundamentally at odds with an important admin feature, [[Expedited delete]]: an admin should have a mechanism to accelerate garbage collection. Ideally, this action can be restricted to the blocks from a specific deleted collection.
84 5 Tom Clegg
85 10 Tom Clegg
Datamanager/keep-balance can use arvados.v1.logs.index to get older versions of each manifest that has been changed or deleted recently (<= blobSignatureTTL seconds ago).
86 5 Tom Clegg
87 10 Tom Clegg
In order to accomplish "expedited delete" (without backdating or deleting log table entries, which would confuse other uses of event logs) the admin tool will need to do a focused garbage collection operation itself: it won't be enough to expire/delete the collection record right away.  The most powerful/immediate variations of "expedited delete" will need to work this way anyway, though, in order to bypass the usual "do not delete blocks newer than permission TTL" restriction for a specific set of affected blocks.
88 5 Tom Clegg
89 2 Tom Clegg
h2. Related: replication_desired=0
90 1 Tom Clegg
91 20 Tom Clegg
A collection with replication_desired=0 does not protect its data from garbage collection. In this sense, replication_desired=0 is similar to now>delete_at.
92 1 Tom Clegg
93
However, replication_desired=0 does not mean the collection record itself should be hidden. It means the collection metadata (filenames, sizes, data hashes, collection PDH) are valuable enough to keep on hand, but the data itself isn't. For example, if we delete intermediate data generated by a workflow, and find later that the same workflow now produces a different result, it would be helpful to see which of the intermediate outputs differed.
94
95 8 Tom Clegg
h2. TBD
96
97 19 Tom Clegg
When deleting a project that contains expiring or persistent collections, presumably the persistent collections should be trashed, but what should their new owner_uuid be?
98 20 Tom Clegg
* Proposed solution: projects themselves also need an trash_at field that works the same way.