Project

General

Profile

S3 bucket volume implementation » History » Version 4

Tom Clegg, 07/06/2016 01:24 PM

1 1 Tom Clegg
h1. S3 bucket volume implementation
2
3
{{toc}}
4
5
h2. Background - unsafe delete
6
7
With the current (2016-07-04) S3 volume implementation, an S3 volume can store blocks safely, but cannot do garbage collection safely.
8
9
There is a command line flag @-s3-unsafe-delete@ that enables unsafe garbage collection. With this setting, it is possible to lose data: if a client writes a new copy of a previously stored block that is now scheduled for deletion, the newly written copy can be deleted even though the client is told the data is safely stored.
10
11
h2. S3 API challenge
12
13
S3 offers eventual consistency. Specifically:
14
* Reading an object doesn't necessarily return the latest version: "write X=foo; write X=bar; read X" might return foo.
15
* Deleting is not immediate: "write X=foo; delete X; read X" might return foo.
16
* There is no atomic conditional write operation (e.g., "overwrite X only if X has etag=foo").
17
18
This makes it hard to do garbage collection. How can we delete X if we don't know whether X has been written recently?
19
20 4 Tom Clegg
h2. Proposed approach
21 1 Tom Clegg
22 4 Tom Clegg
The state of a block is represented by the existence and timestamps of three objects:
23 1 Tom Clegg
* Data object, name="{md5}" -- contains the stored data; timestamp and metadata are irrelevant.
24
* Mtime object, name="recent/{md5}" -- zero data bytes; Last-Modified reflects most recent Put or Touch.
25
* Trash object, name="trash/{md5}" -- contains a copy of the stored data; Last-Modified reflects most recent Trash.
26 4 Tom Clegg
27
The strategy is:
28
# Tolerate, and automatically clean up, untidy state caused by races.
29
# Avoid entering races where there's a possibility of losing data.
30
31
|X|recent/X|trash/X|State|
32
|✓|-|-|X is not trash. It was written by an old version, or there was a race between EmptyTrash and Put. To fix, create a new recent/X.|
33
|✓|new|-|X is not trash, and is new. If it appears in a trash list, do nothing.|
34
|✓|old|-|X is not trash, and is old. If it appears in a trash list, trash it.|
35
|✓|any|✓|X is not trash. There was a race between Trash and Put, or Trash didn't (yet) finish.|
36
|-|old|✓|X is trash.|
37
|-|new|✓|X is not trash. There was a race between Trash and Put.|
38
|-|-|✓|Impossible? (Race between Trash, EmptyTrash, and Put?) To fix, touch recent/X and copy trash/X to X.|
39
40
h2. Implementation draft
41 1 Tom Clegg
42
Put(X):
43
# Write data to data object, "X"
44
# Write zero bytes to mtime object, "recent/X"
45
46
Touch(X):
47
# Write zero bytes to mtime object, "recent/X"
48
# Ensure data object "X" exists. If not:
49
#* If FixRace(X) returns false, return ErrNotExist
50
#* Else return nil (OK)
51
#* (Handlers only use Touch after Get succeeds, so a 404 here is almost certainly an early opportunity to catch and correct a touch-vs.-trash race.)
52
53
Mtime(X):
54
# Ensure data object "X" exists. If not, return ErrNotExist.
55
# Ensure "recent/X" exists. If not, create it.
56
# Return LastModified("recent/X")
57
58
Trash(X):
59
# If trashLifetime==0 && !unsafeDelete: "not implemented" error
60 2 Tom Clegg
# If "trash/X" exists and LastModified("trash/X") + trashLifetime > now + raceWindow, then it is not safe to delete "X", so return without doing anything.
61 1 Tom Clegg
# Get "recent/X"
62
# PutCopy("X", "trash/X")
63
# Del("X")
64
65
Get(X):
66
# If Get("X") succeeds, return content
67
# Else if Get returned a 404, and FixRace(X) returns true, call Get("X") again and return content
68
69
FixRace(X): "check whether X needs to be cleaned up after a race; if so, clean it up. Return true if X was untrashed during cleanup."
70
# Use HEAD to get LastModified("recent/X").
71
# If "recent/X" does not exist, an emptyTrash/Trash race went wrong, so we need to fix it by saving the very conservative value "now" as "most recent Put or Touch":
72
#* Put("recent/X", zeroBytes)
73
#* Log this event.
74
# Use HEAD to get LastModified("trash/X"). If "trash/X" does not exist, return false.
75
# If LastModified("trash/X") < LastModified("recent/X") + permissionTTL, a touch/trash race went wrong, so we need to fix it:
76
#* Perform HEAD requests for "X" and "trash/X"
77
#* If "X" exists but etag differs from "trash/X", log a collision warning and do nothing
78
#* If "X" exists and etag matches, delete "trash/X"
79
#* If "X" does not exist, PutCopy("trash/X", "X") (but don't delete "trash/X" -- EmptyTrash will get it next time)
80
#* Log this event.
81
#* Return true.
82
# Return false.
83
84
Untrash(X):
85
# PutCopy("trash/X", "X") (but don't delete "trash/X" -- EmptyTrash will get it next time)
86
87
EmptyTrash():
88
# List all trash objects and mtime objects ("trash/X" and "recent/X"), using a merge sort to get them in matching pairs. For each X...
89
# If LastModified("trash/X") < LastModified("recent/X") + permissionTTL: call FixRace(X)
90
# Else if now() > LastModified("trash/X") + trashLifetime, it's time to really delete this object:
91
#* Del("trash/X")
92 2 Tom Clegg
#* Del("recent/X") unless "X" exists
93 1 Tom Clegg
94
Index():
95
# List all data objects and mtime objects ("X" and "recent/X"), using a merge sort to get them in matching pairs.
96
# Include X in the index only if "X" *and* "recent/X" exist.
97
# In the index response row "hash+size timestamp": hash is X, size is Size("X"), timestamp is LastModified("recent/X")
98
99 2 Tom Clegg
Sometimes, races will cause valuable blocks to be trashed, but they will be recovered automatically by FixRace(), provided:
100
* the S3 service achieves eventual consistency faster than raceWindow, and
101
* raceWindow < trashLifetime
102 1 Tom Clegg
103
These "accidentally trashed" blocks will be omitted from the index, which means keep-balance will occasionally report them as "lost". This is undesirable, because never cry wolf.
104
105
h2. Unresolved issues
106
107 3 Tom Clegg
Is there a deadline for eventual consistency? IOW, is there a value raceWindow can be set to?