Project

General

Profile

S3 bucket volume implementation » History » Version 1

Tom Clegg, 07/04/2016 02:23 PM

1 1 Tom Clegg
h1. S3 bucket volume implementation
2
3
{{toc}}
4
5
h2. Background - unsafe delete
6
7
With the current (2016-07-04) S3 volume implementation, an S3 volume can store blocks safely, but cannot do garbage collection safely.
8
9
There is a command line flag @-s3-unsafe-delete@ that enables unsafe garbage collection. With this setting, it is possible to lose data: if a client writes a new copy of a previously stored block that is now scheduled for deletion, the newly written copy can be deleted even though the client is told the data is safely stored.
10
11
h2. S3 API challenge
12
13
S3 offers eventual consistency. Specifically:
14
* Reading an object doesn't necessarily return the latest version: "write X=foo; write X=bar; read X" might return foo.
15
* Deleting is not immediate: "write X=foo; delete X; read X" might return foo.
16
* There is no atomic conditional write operation (e.g., "overwrite X only if X has etag=foo").
17
18
This makes it hard to do garbage collection. How can we delete X if we don't know whether X has been written recently?
19
20
h2. Implementation draft
21
22
Proposed approach: Each block is represented by (up to) three S3 objects.
23
* Data object, name="{md5}" -- contains the stored data; timestamp and metadata are irrelevant.
24
* Mtime object, name="recent/{md5}" -- zero data bytes; Last-Modified reflects most recent Put or Touch.
25
* Trash object, name="trash/{md5}" -- contains a copy of the stored data; Last-Modified reflects most recent Trash.
26
27
Put(X):
28
# Write data to data object, "X"
29
# Write zero bytes to mtime object, "recent/X"
30
31
Touch(X):
32
# Write zero bytes to mtime object, "recent/X"
33
# Ensure data object "X" exists. If not:
34
#* If FixRace(X) returns false, return ErrNotExist
35
#* Else return nil (OK)
36
#* (Handlers only use Touch after Get succeeds, so a 404 here is almost certainly an early opportunity to catch and correct a touch-vs.-trash race.)
37
38
Mtime(X):
39
# Ensure data object "X" exists. If not, return ErrNotExist.
40
# Ensure "recent/X" exists. If not, create it.
41
# Return LastModified("recent/X")
42
43
Trash(X):
44
# If trashLifetime==0 && !unsafeDelete: "not implemented" error
45
# Get "recent/X"
46
# PutCopy("X", "trash/X")
47
# Del("X")
48
49
Get(X):
50
# If Get("X") succeeds, return content
51
# Else if Get returned a 404, and FixRace(X) returns true, call Get("X") again and return content
52
53
FixRace(X): "check whether X needs to be cleaned up after a race; if so, clean it up. Return true if X was untrashed during cleanup."
54
# Use HEAD to get LastModified("recent/X").
55
# If "recent/X" does not exist, an emptyTrash/Trash race went wrong, so we need to fix it by saving the very conservative value "now" as "most recent Put or Touch":
56
#* Put("recent/X", zeroBytes)
57
#* Log this event.
58
# Use HEAD to get LastModified("trash/X"). If "trash/X" does not exist, return false.
59
# If LastModified("trash/X") < LastModified("recent/X") + permissionTTL, a touch/trash race went wrong, so we need to fix it:
60
#* Perform HEAD requests for "X" and "trash/X"
61
#* If "X" exists but etag differs from "trash/X", log a collision warning and do nothing
62
#* If "X" exists and etag matches, delete "trash/X"
63
#* If "X" does not exist, PutCopy("trash/X", "X") (but don't delete "trash/X" -- EmptyTrash will get it next time)
64
#* Log this event.
65
#* Return true.
66
# Return false.
67
68
Untrash(X):
69
# PutCopy("trash/X", "X") (but don't delete "trash/X" -- EmptyTrash will get it next time)
70
71
EmptyTrash():
72
# List all trash objects and mtime objects ("trash/X" and "recent/X"), using a merge sort to get them in matching pairs. For each X...
73
# If LastModified("trash/X") < LastModified("recent/X") + permissionTTL: call FixRace(X)
74
# Else if now() > LastModified("trash/X") + trashLifetime, it's time to really delete this object:
75
#* Del("trash/X")
76
#* Del("recent/X")
77
78
Index():
79
# List all data objects and mtime objects ("X" and "recent/X"), using a merge sort to get them in matching pairs.
80
# Include X in the index only if "X" *and* "recent/X" exist.
81
# In the index response row "hash+size timestamp": hash is X, size is Size("X"), timestamp is LastModified("recent/X")
82
83
Sometimes, races will cause valuable blocks to be trashed, but they will be recovered automatically by FixRace(), _provided the S3 service achieves eventual consistency faster than trashLifetime._
84
85
These "accidentally trashed" blocks will be omitted from the index, which means keep-balance will occasionally report them as "lost". This is undesirable, because never cry wolf.
86
87
h2. Unresolved issues
88
89
If both X and trash/X exist (e.g., a client has written a copy of a block that was already in trash), a race between Trash and FixRace can delete both X and trash/X (see #8555#note-21).