S3 bucket volume implementation » History » Version 7
Tom Clegg, 07/08/2016 07:44 PM
1 | 1 | Tom Clegg | h1. S3 bucket volume implementation |
---|---|---|---|
2 | |||
3 | {{toc}} |
||
4 | |||
5 | h2. Background - unsafe delete |
||
6 | |||
7 | With the current (2016-07-04) S3 volume implementation, an S3 volume can store blocks safely, but cannot do garbage collection safely. |
||
8 | |||
9 | There is a command line flag @-s3-unsafe-delete@ that enables unsafe garbage collection. With this setting, it is possible to lose data: if a client writes a new copy of a previously stored block that is now scheduled for deletion, the newly written copy can be deleted even though the client is told the data is safely stored. |
||
10 | |||
11 | h2. S3 API challenge |
||
12 | |||
13 | S3 offers eventual consistency. Specifically: |
||
14 | * Reading an object doesn't necessarily return the latest version: "write X=foo; write X=bar; read X" might return foo. |
||
15 | * Deleting is not immediate: "write X=foo; delete X; read X" might return foo. |
||
16 | * There is no atomic conditional write operation (e.g., "overwrite X only if X has etag=foo"). |
||
17 | |||
18 | This makes it hard to do garbage collection. How can we delete X if we don't know whether X has been written recently? |
||
19 | |||
20 | 4 | Tom Clegg | h2. Proposed approach |
21 | 1 | Tom Clegg | |
22 | 4 | Tom Clegg | The state of a block is represented by the existence and timestamps of three objects: |
23 | 1 | Tom Clegg | * Data object, name="{md5}" -- contains the stored data; timestamp and metadata are irrelevant. |
24 | * Mtime object, name="recent/{md5}" -- zero data bytes; Last-Modified reflects most recent Put or Touch. |
||
25 | * Trash object, name="trash/{md5}" -- contains a copy of the stored data; Last-Modified reflects most recent Trash. |
||
26 | 4 | Tom Clegg | |
27 | The strategy is: |
||
28 | # Tolerate, and automatically clean up, untidy state caused by races. |
||
29 | # Avoid entering races where there's a possibility of losing data. |
||
30 | |||
31 | 5 | Tom Clegg | Specifically, there is a window, just before a trashed block becomes eligible for final deletion, when it is impossible to safely trash a newer copy of that block: |
32 | * Racing EmptyTrash ("delete trash/X if it's old enough") against Trash ("copy X to trash/X, then delete X") can result in both copies (X and trash/X) being deleted: EmptyTrash inspects state and issues a "delete trash/X" request; then Trash issues "copy X to trash/X" and "delete X" requests; then the S3 service performs those two operations in the reverse order, and both copies are deleted. |
||
33 | * Meanwhile, if Put(X) was also racing, we would discover next time we call FixRace() that X shouldn't have been trashed at all, and we should rescue it from the trash ... but that would be impossible. |
||
34 | |||
35 | 4 | Tom Clegg | |X|recent/X|trash/X|State| |
36 | |✓|-|-|X is not trash. It was written by an old version, or there was a race between EmptyTrash and Put. To fix, create a new recent/X.| |
||
37 | |✓|new|-|X is not trash, and is new. If it appears in a trash list, do nothing.| |
||
38 | |✓|old|-|X is not trash, and is old. If it appears in a trash list, trash it.| |
||
39 | |✓|any|✓|X is not trash. There was a race between Trash and Put, or Trash didn't (yet) finish.| |
||
40 | 6 | Tom Clegg | |-|old|new*|X is trash, not yet eligible for deletion. *newer than (now - trashLifetime + raceWindow)| |
41 | |-|old|racy*|X is trash, not yet eligible for deletion, and prone to races with Trash. *older than (now - trashLifetime + raceWindow)| |
||
42 | |-|old|old*|X is trash, eligible for deletion. *older than (now - trashLifetime)| |
||
43 | 4 | Tom Clegg | |-|new|✓|X is not trash. There was a race between Trash and Put.| |
44 | |-|-|✓|Impossible? (Race between Trash, EmptyTrash, and Put?) To fix, touch recent/X and copy trash/X to X.| |
||
45 | |||
46 | h2. Implementation draft |
||
47 | 1 | Tom Clegg | |
48 | Put(X): |
||
49 | # Write data to data object, "X" |
||
50 | # Write zero bytes to mtime object, "recent/X" |
||
51 | |||
52 | Touch(X): |
||
53 | # Write zero bytes to mtime object, "recent/X" |
||
54 | # Ensure data object "X" exists. If not: |
||
55 | #* If FixRace(X) returns false, return ErrNotExist |
||
56 | #* Else return nil (OK) |
||
57 | #* (Handlers only use Touch after Get succeeds, so a 404 here is almost certainly an early opportunity to catch and correct a touch-vs.-trash race.) |
||
58 | |||
59 | Mtime(X): |
||
60 | # Ensure data object "X" exists. If not, return ErrNotExist. |
||
61 | # Ensure "recent/X" exists. If not, create it. |
||
62 | # Return LastModified("recent/X") |
||
63 | |||
64 | Trash(X): |
||
65 | # If trashLifetime==0 && !unsafeDelete: "not implemented" error |
||
66 | 7 | Tom Clegg | # If "trash/X" exists and LastModified("trash/X") + trashLifetime < now + raceWindow, then it is not safe to delete "X", so return without doing anything. |
67 | 1 | Tom Clegg | # Get "recent/X" |
68 | # PutCopy("X", "trash/X") |
||
69 | # Del("X") |
||
70 | |||
71 | Get(X): |
||
72 | # If Get("X") succeeds, return content |
||
73 | # Else if Get returned a 404, and FixRace(X) returns true, call Get("X") again and return content |
||
74 | |||
75 | FixRace(X): "check whether X needs to be cleaned up after a race; if so, clean it up. Return true if X was untrashed during cleanup." |
||
76 | # Use HEAD to get LastModified("recent/X"). |
||
77 | # If "recent/X" does not exist, an emptyTrash/Trash race went wrong, so we need to fix it by saving the very conservative value "now" as "most recent Put or Touch": |
||
78 | #* Put("recent/X", zeroBytes) |
||
79 | #* Log this event. |
||
80 | # Use HEAD to get LastModified("trash/X"). If "trash/X" does not exist, return false. |
||
81 | # If LastModified("trash/X") < LastModified("recent/X") + permissionTTL, a touch/trash race went wrong, so we need to fix it: |
||
82 | #* Perform HEAD requests for "X" and "trash/X" |
||
83 | 7 | Tom Clegg | #* PutCopy("trash/X", "X") (but don't delete "trash/X" -- EmptyTrash will get it next time) |
84 | 1 | Tom Clegg | #* Log this event. |
85 | #* Return true. |
||
86 | # Return false. |
||
87 | |||
88 | Untrash(X): |
||
89 | # PutCopy("trash/X", "X") (but don't delete "trash/X" -- EmptyTrash will get it next time) |
||
90 | |||
91 | EmptyTrash(): |
||
92 | # List all trash objects and mtime objects ("trash/X" and "recent/X"), using a merge sort to get them in matching pairs. For each X... |
||
93 | # If LastModified("trash/X") < LastModified("recent/X") + permissionTTL: call FixRace(X) |
||
94 | # Else if now() > LastModified("trash/X") + trashLifetime, it's time to really delete this object: |
||
95 | #* Del("trash/X") |
||
96 | 2 | Tom Clegg | #* Del("recent/X") unless "X" exists |
97 | 1 | Tom Clegg | |
98 | Index(): |
||
99 | # List all data objects and mtime objects ("X" and "recent/X"), using a merge sort to get them in matching pairs. |
||
100 | # Include X in the index only if "X" *and* "recent/X" exist. |
||
101 | # In the index response row "hash+size timestamp": hash is X, size is Size("X"), timestamp is LastModified("recent/X") |
||
102 | |||
103 | 2 | Tom Clegg | Sometimes, races will cause valuable blocks to be trashed, but they will be recovered automatically by FixRace(), provided: |
104 | * the S3 service achieves eventual consistency faster than raceWindow, and |
||
105 | * raceWindow < trashLifetime |
||
106 | 1 | Tom Clegg | |
107 | These "accidentally trashed" blocks will be omitted from the index, which means keep-balance will occasionally report them as "lost". This is undesirable, because never cry wolf. |
||
108 | |||
109 | h2. Unresolved issues |
||
110 | |||
111 | 3 | Tom Clegg | Is there a deadline for eventual consistency? IOW, is there a value raceWindow can be set to? |