S3 bucket volume implementation » History » Version 4
Tom Clegg, 07/06/2016 01:24 PM
1 | 1 | Tom Clegg | h1. S3 bucket volume implementation |
---|---|---|---|
2 | |||
3 | {{toc}} |
||
4 | |||
5 | h2. Background - unsafe delete |
||
6 | |||
7 | With the current (2016-07-04) S3 volume implementation, an S3 volume can store blocks safely, but cannot do garbage collection safely. |
||
8 | |||
9 | There is a command line flag @-s3-unsafe-delete@ that enables unsafe garbage collection. With this setting, it is possible to lose data: if a client writes a new copy of a previously stored block that is now scheduled for deletion, the newly written copy can be deleted even though the client is told the data is safely stored. |
||
10 | |||
11 | h2. S3 API challenge |
||
12 | |||
13 | S3 offers eventual consistency. Specifically: |
||
14 | * Reading an object doesn't necessarily return the latest version: "write X=foo; write X=bar; read X" might return foo. |
||
15 | * Deleting is not immediate: "write X=foo; delete X; read X" might return foo. |
||
16 | * There is no atomic conditional write operation (e.g., "overwrite X only if X has etag=foo"). |
||
17 | |||
18 | This makes it hard to do garbage collection. How can we delete X if we don't know whether X has been written recently? |
||
19 | |||
20 | 4 | Tom Clegg | h2. Proposed approach |
21 | 1 | Tom Clegg | |
22 | 4 | Tom Clegg | The state of a block is represented by the existence and timestamps of three objects: |
23 | 1 | Tom Clegg | * Data object, name="{md5}" -- contains the stored data; timestamp and metadata are irrelevant. |
24 | * Mtime object, name="recent/{md5}" -- zero data bytes; Last-Modified reflects most recent Put or Touch. |
||
25 | * Trash object, name="trash/{md5}" -- contains a copy of the stored data; Last-Modified reflects most recent Trash. |
||
26 | 4 | Tom Clegg | |
27 | The strategy is: |
||
28 | # Tolerate, and automatically clean up, untidy state caused by races. |
||
29 | # Avoid entering races where there's a possibility of losing data. |
||
30 | |||
31 | |X|recent/X|trash/X|State| |
||
32 | |✓|-|-|X is not trash. It was written by an old version, or there was a race between EmptyTrash and Put. To fix, create a new recent/X.| |
||
33 | |✓|new|-|X is not trash, and is new. If it appears in a trash list, do nothing.| |
||
34 | |✓|old|-|X is not trash, and is old. If it appears in a trash list, trash it.| |
||
35 | |✓|any|✓|X is not trash. There was a race between Trash and Put, or Trash didn't (yet) finish.| |
||
36 | |-|old|✓|X is trash.| |
||
37 | |-|new|✓|X is not trash. There was a race between Trash and Put.| |
||
38 | |-|-|✓|Impossible? (Race between Trash, EmptyTrash, and Put?) To fix, touch recent/X and copy trash/X to X.| |
||
39 | |||
40 | h2. Implementation draft |
||
41 | 1 | Tom Clegg | |
42 | Put(X): |
||
43 | # Write data to data object, "X" |
||
44 | # Write zero bytes to mtime object, "recent/X" |
||
45 | |||
46 | Touch(X): |
||
47 | # Write zero bytes to mtime object, "recent/X" |
||
48 | # Ensure data object "X" exists. If not: |
||
49 | #* If FixRace(X) returns false, return ErrNotExist |
||
50 | #* Else return nil (OK) |
||
51 | #* (Handlers only use Touch after Get succeeds, so a 404 here is almost certainly an early opportunity to catch and correct a touch-vs.-trash race.) |
||
52 | |||
53 | Mtime(X): |
||
54 | # Ensure data object "X" exists. If not, return ErrNotExist. |
||
55 | # Ensure "recent/X" exists. If not, create it. |
||
56 | # Return LastModified("recent/X") |
||
57 | |||
58 | Trash(X): |
||
59 | # If trashLifetime==0 && !unsafeDelete: "not implemented" error |
||
60 | 2 | Tom Clegg | # If "trash/X" exists and LastModified("trash/X") + trashLifetime > now + raceWindow, then it is not safe to delete "X", so return without doing anything. |
61 | 1 | Tom Clegg | # Get "recent/X" |
62 | # PutCopy("X", "trash/X") |
||
63 | # Del("X") |
||
64 | |||
65 | Get(X): |
||
66 | # If Get("X") succeeds, return content |
||
67 | # Else if Get returned a 404, and FixRace(X) returns true, call Get("X") again and return content |
||
68 | |||
69 | FixRace(X): "check whether X needs to be cleaned up after a race; if so, clean it up. Return true if X was untrashed during cleanup." |
||
70 | # Use HEAD to get LastModified("recent/X"). |
||
71 | # If "recent/X" does not exist, an emptyTrash/Trash race went wrong, so we need to fix it by saving the very conservative value "now" as "most recent Put or Touch": |
||
72 | #* Put("recent/X", zeroBytes) |
||
73 | #* Log this event. |
||
74 | # Use HEAD to get LastModified("trash/X"). If "trash/X" does not exist, return false. |
||
75 | # If LastModified("trash/X") < LastModified("recent/X") + permissionTTL, a touch/trash race went wrong, so we need to fix it: |
||
76 | #* Perform HEAD requests for "X" and "trash/X" |
||
77 | #* If "X" exists but etag differs from "trash/X", log a collision warning and do nothing |
||
78 | #* If "X" exists and etag matches, delete "trash/X" |
||
79 | #* If "X" does not exist, PutCopy("trash/X", "X") (but don't delete "trash/X" -- EmptyTrash will get it next time) |
||
80 | #* Log this event. |
||
81 | #* Return true. |
||
82 | # Return false. |
||
83 | |||
84 | Untrash(X): |
||
85 | # PutCopy("trash/X", "X") (but don't delete "trash/X" -- EmptyTrash will get it next time) |
||
86 | |||
87 | EmptyTrash(): |
||
88 | # List all trash objects and mtime objects ("trash/X" and "recent/X"), using a merge sort to get them in matching pairs. For each X... |
||
89 | # If LastModified("trash/X") < LastModified("recent/X") + permissionTTL: call FixRace(X) |
||
90 | # Else if now() > LastModified("trash/X") + trashLifetime, it's time to really delete this object: |
||
91 | #* Del("trash/X") |
||
92 | 2 | Tom Clegg | #* Del("recent/X") unless "X" exists |
93 | 1 | Tom Clegg | |
94 | Index(): |
||
95 | # List all data objects and mtime objects ("X" and "recent/X"), using a merge sort to get them in matching pairs. |
||
96 | # Include X in the index only if "X" *and* "recent/X" exist. |
||
97 | # In the index response row "hash+size timestamp": hash is X, size is Size("X"), timestamp is LastModified("recent/X") |
||
98 | |||
99 | 2 | Tom Clegg | Sometimes, races will cause valuable blocks to be trashed, but they will be recovered automatically by FixRace(), provided: |
100 | * the S3 service achieves eventual consistency faster than raceWindow, and |
||
101 | * raceWindow < trashLifetime |
||
102 | 1 | Tom Clegg | |
103 | These "accidentally trashed" blocks will be omitted from the index, which means keep-balance will occasionally report them as "lost". This is undesirable, because never cry wolf. |
||
104 | |||
105 | h2. Unresolved issues |
||
106 | |||
107 | 3 | Tom Clegg | Is there a deadline for eventual consistency? IOW, is there a value raceWindow can be set to? |