S3 bucket volume implementation » History » Version 2
Tom Clegg, 07/04/2016 08:30 PM
1 | 1 | Tom Clegg | h1. S3 bucket volume implementation |
---|---|---|---|
2 | |||
3 | {{toc}} |
||
4 | |||
5 | h2. Background - unsafe delete |
||
6 | |||
7 | With the current (2016-07-04) S3 volume implementation, an S3 volume can store blocks safely, but cannot do garbage collection safely. |
||
8 | |||
9 | There is a command line flag @-s3-unsafe-delete@ that enables unsafe garbage collection. With this setting, it is possible to lose data: if a client writes a new copy of a previously stored block that is now scheduled for deletion, the newly written copy can be deleted even though the client is told the data is safely stored. |
||
10 | |||
11 | h2. S3 API challenge |
||
12 | |||
13 | S3 offers eventual consistency. Specifically: |
||
14 | * Reading an object doesn't necessarily return the latest version: "write X=foo; write X=bar; read X" might return foo. |
||
15 | * Deleting is not immediate: "write X=foo; delete X; read X" might return foo. |
||
16 | * There is no atomic conditional write operation (e.g., "overwrite X only if X has etag=foo"). |
||
17 | |||
18 | This makes it hard to do garbage collection. How can we delete X if we don't know whether X has been written recently? |
||
19 | |||
20 | h2. Implementation draft |
||
21 | |||
22 | Proposed approach: Each block is represented by (up to) three S3 objects. |
||
23 | * Data object, name="{md5}" -- contains the stored data; timestamp and metadata are irrelevant. |
||
24 | * Mtime object, name="recent/{md5}" -- zero data bytes; Last-Modified reflects most recent Put or Touch. |
||
25 | * Trash object, name="trash/{md5}" -- contains a copy of the stored data; Last-Modified reflects most recent Trash. |
||
26 | |||
27 | Put(X): |
||
28 | # Write data to data object, "X" |
||
29 | # Write zero bytes to mtime object, "recent/X" |
||
30 | |||
31 | Touch(X): |
||
32 | # Write zero bytes to mtime object, "recent/X" |
||
33 | # Ensure data object "X" exists. If not: |
||
34 | #* If FixRace(X) returns false, return ErrNotExist |
||
35 | #* Else return nil (OK) |
||
36 | #* (Handlers only use Touch after Get succeeds, so a 404 here is almost certainly an early opportunity to catch and correct a touch-vs.-trash race.) |
||
37 | |||
38 | Mtime(X): |
||
39 | # Ensure data object "X" exists. If not, return ErrNotExist. |
||
40 | # Ensure "recent/X" exists. If not, create it. |
||
41 | # Return LastModified("recent/X") |
||
42 | |||
43 | Trash(X): |
||
44 | # If trashLifetime==0 && !unsafeDelete: "not implemented" error |
||
45 | 2 | Tom Clegg | # If "trash/X" exists and LastModified("trash/X") + trashLifetime > now + raceWindow, then it is not safe to delete "X", so return without doing anything. |
46 | 1 | Tom Clegg | # Get "recent/X" |
47 | # PutCopy("X", "trash/X") |
||
48 | # Del("X") |
||
49 | |||
50 | Get(X): |
||
51 | # If Get("X") succeeds, return content |
||
52 | # Else if Get returned a 404, and FixRace(X) returns true, call Get("X") again and return content |
||
53 | |||
54 | FixRace(X): "check whether X needs to be cleaned up after a race; if so, clean it up. Return true if X was untrashed during cleanup." |
||
55 | # Use HEAD to get LastModified("recent/X"). |
||
56 | # If "recent/X" does not exist, an emptyTrash/Trash race went wrong, so we need to fix it by saving the very conservative value "now" as "most recent Put or Touch": |
||
57 | #* Put("recent/X", zeroBytes) |
||
58 | #* Log this event. |
||
59 | # Use HEAD to get LastModified("trash/X"). If "trash/X" does not exist, return false. |
||
60 | # If LastModified("trash/X") < LastModified("recent/X") + permissionTTL, a touch/trash race went wrong, so we need to fix it: |
||
61 | #* Perform HEAD requests for "X" and "trash/X" |
||
62 | #* If "X" exists but etag differs from "trash/X", log a collision warning and do nothing |
||
63 | #* If "X" exists and etag matches, delete "trash/X" |
||
64 | #* If "X" does not exist, PutCopy("trash/X", "X") (but don't delete "trash/X" -- EmptyTrash will get it next time) |
||
65 | #* Log this event. |
||
66 | #* Return true. |
||
67 | # Return false. |
||
68 | |||
69 | Untrash(X): |
||
70 | # PutCopy("trash/X", "X") (but don't delete "trash/X" -- EmptyTrash will get it next time) |
||
71 | |||
72 | EmptyTrash(): |
||
73 | # List all trash objects and mtime objects ("trash/X" and "recent/X"), using a merge sort to get them in matching pairs. For each X... |
||
74 | # If LastModified("trash/X") < LastModified("recent/X") + permissionTTL: call FixRace(X) |
||
75 | # Else if now() > LastModified("trash/X") + trashLifetime, it's time to really delete this object: |
||
76 | #* Del("trash/X") |
||
77 | 2 | Tom Clegg | #* Del("recent/X") unless "X" exists |
78 | 1 | Tom Clegg | |
79 | Index(): |
||
80 | # List all data objects and mtime objects ("X" and "recent/X"), using a merge sort to get them in matching pairs. |
||
81 | # Include X in the index only if "X" *and* "recent/X" exist. |
||
82 | # In the index response row "hash+size timestamp": hash is X, size is Size("X"), timestamp is LastModified("recent/X") |
||
83 | |||
84 | 2 | Tom Clegg | Sometimes, races will cause valuable blocks to be trashed, but they will be recovered automatically by FixRace(), provided: |
85 | * the S3 service achieves eventual consistency faster than raceWindow, and |
||
86 | * raceWindow < trashLifetime |
||
87 | 1 | Tom Clegg | |
88 | These "accidentally trashed" blocks will be omitted from the index, which means keep-balance will occasionally report them as "lost". This is undesirable, because never cry wolf. |
||
89 | |||
90 | h2. Unresolved issues |
||
91 | |||
92 | If both X and trash/X exist (e.g., a client has written a copy of a block that was already in trash), a race between Trash and FixRace can delete both X and trash/X (see #8555#note-21). |