Project

General

Profile

S3 bucket volume implementation » History » Version 9

Tom Clegg, 07/14/2016 07:34 PM

1 1 Tom Clegg
h1. S3 bucket volume implementation
2
3
{{toc}}
4
5
h2. Background - unsafe delete
6
7
With the current (2016-07-04) S3 volume implementation, an S3 volume can store blocks safely, but cannot do garbage collection safely.
8
9
There is a command line flag @-s3-unsafe-delete@ that enables unsafe garbage collection. With this setting, it is possible to lose data: if a client writes a new copy of a previously stored block that is now scheduled for deletion, the newly written copy can be deleted even though the client is told the data is safely stored.
10
11
h2. S3 API challenge
12
13
S3 offers eventual consistency. Specifically:
14
* Reading an object doesn't necessarily return the latest version: "write X=foo; write X=bar; read X" might return foo.
15
* Deleting is not immediate: "write X=foo; delete X; read X" might return foo.
16
* There is no atomic conditional write operation (e.g., "overwrite X only if X has etag=foo").
17
18
This makes it hard to do garbage collection. How can we delete X if we don't know whether X has been written recently?
19
20 4 Tom Clegg
h2. Proposed approach
21 1 Tom Clegg
22 4 Tom Clegg
The state of a block is represented by the existence and timestamps of three objects:
23 1 Tom Clegg
* Data object, name="{md5}" -- contains the stored data; timestamp and metadata are irrelevant.
24
* Mtime object, name="recent/{md5}" -- zero data bytes; Last-Modified reflects most recent Put or Touch.
25
* Trash object, name="trash/{md5}" -- contains a copy of the stored data; Last-Modified reflects most recent Trash.
26 4 Tom Clegg
27
The strategy is:
28
# Tolerate, and automatically clean up, untidy state caused by races.
29
# Avoid entering races where there's a possibility of losing data.
30
31 5 Tom Clegg
Specifically, there is a window, just before a trashed block becomes eligible for final deletion, when it is impossible to safely trash a newer copy of that block:
32
* Racing EmptyTrash ("delete trash/X if it's old enough") against Trash ("copy X to trash/X, then delete X") can result in both copies (X and trash/X) being deleted: EmptyTrash inspects state and issues a "delete trash/X" request; then Trash issues "copy X to trash/X" and "delete X" requests; then the S3 service performs those two operations in the reverse order, and both copies are deleted.
33
* Meanwhile, if Put(X) was also racing, we would discover next time we call FixRace() that X shouldn't have been trashed at all, and we should rescue it from the trash ... but that would be impossible.
34
35 4 Tom Clegg
|X|recent/X|trash/X|State|
36
|✓|-|-|X is not trash. It was written by an old version, or there was a race between EmptyTrash and Put. To fix, create a new recent/X.|
37
|✓|new|-|X is not trash, and is new. If it appears in a trash list, do nothing.|
38
|✓|old|-|X is not trash, and is old. If it appears in a trash list, trash it.|
39
|✓|any|✓|X is not trash. There was a race between Trash and Put, or Trash didn't (yet) finish.|
40 6 Tom Clegg
|-|old|new*|X is trash, not yet eligible for deletion. *newer than (now - trashLifetime + raceWindow)|
41
|-|old|racy*|X is trash, not yet eligible for deletion, and prone to races with Trash. *older than (now - trashLifetime + raceWindow)|
42
|-|old|old*|X is trash, eligible for deletion. *older than (now - trashLifetime)|
43 4 Tom Clegg
|-|new|✓|X is not trash. There was a race between Trash and Put.|
44
|-|-|✓|Impossible? (Race between Trash, EmptyTrash, and Put?) To fix, touch recent/X and copy trash/X to X.|
45
46
h2. Implementation draft
47 1 Tom Clegg
48
Put(X):
49
# Write data to data object, "X"
50
# Write zero bytes to mtime object, "recent/X"
51
52
Touch(X):
53
# Write zero bytes to mtime object, "recent/X"
54
# Ensure data object "X" exists. If not:
55
#* If FixRace(X) returns false, return ErrNotExist
56
#* Else return nil (OK)
57
#* (Handlers only use Touch after Get succeeds, so a 404 here is almost certainly an early opportunity to catch and correct a touch-vs.-trash race.)
58
59
Mtime(X):
60
# Ensure data object "X" exists. If not, return ErrNotExist.
61
# Ensure "recent/X" exists. If not, create it.
62
# Return LastModified("recent/X")
63
64
Trash(X):
65
# If trashLifetime==0 && !unsafeDelete: "not implemented" error
66 7 Tom Clegg
# If "trash/X" exists and LastModified("trash/X") + trashLifetime < now + raceWindow, then it is not safe to delete "X", so return without doing anything.
67 1 Tom Clegg
# Get "recent/X"
68
# PutCopy("X", "trash/X")
69
# Del("X")
70
71
Get(X):
72
# If Get("X") succeeds, return content
73
# Else if Get returned a 404, and FixRace(X) returns true, call Get("X") again and return content
74
75
FixRace(X): "check whether X needs to be cleaned up after a race; if so, clean it up. Return true if X was untrashed during cleanup."
76
# Use HEAD to get LastModified("trash/X"). If "trash/X" does not exist, return false.
77 8 Tom Clegg
# Use HEAD to get LastModified("recent/X").
78 1 Tom Clegg
# If LastModified("trash/X") < LastModified("recent/X") + permissionTTL, a touch/trash race went wrong, so we need to fix it:
79
#* Perform HEAD requests for "X" and "trash/X"
80 7 Tom Clegg
#* PutCopy("trash/X", "X") (but don't delete "trash/X" -- EmptyTrash will get it next time)
81 1 Tom Clegg
#* Log this event.
82
#* Return true.
83
# Return false.
84
85
Untrash(X):
86
# PutCopy("trash/X", "X") (but don't delete "trash/X" -- EmptyTrash will get it next time)
87
88
EmptyTrash():
89 9 Tom Clegg
# List all trash objects. For each X...
90
# Look up corresponding mtime object ("recent/X")
91 1 Tom Clegg
# If LastModified("trash/X") < LastModified("recent/X") + permissionTTL: call FixRace(X)
92
# Else if now() > LastModified("trash/X") + trashLifetime, it's time to really delete this object:
93
#* Del("trash/X")
94 2 Tom Clegg
#* Del("recent/X") unless "X" exists
95 1 Tom Clegg
96
Index():
97
# List all data objects and mtime objects ("X" and "recent/X"), using a merge sort to get them in matching pairs.
98
# Include X in the index only if "X" *and* "recent/X" exist.
99
# In the index response row "hash+size timestamp": hash is X, size is Size("X"), timestamp is LastModified("recent/X")
100
101 2 Tom Clegg
Sometimes, races will cause valuable blocks to be trashed, but they will be recovered automatically by FixRace(), provided:
102
* the S3 service achieves eventual consistency faster than raceWindow, and
103
* raceWindow < trashLifetime
104 1 Tom Clegg
105
These "accidentally trashed" blocks will be omitted from the index, which means keep-balance will occasionally report them as "lost". This is undesirable, because never cry wolf.
106
107
h2. Unresolved issues
108
109 3 Tom Clegg
Is there a deadline for eventual consistency? IOW, is there a value raceWindow can be set to?