S3 bucket volume implementation

Background - unsafe delete

With the current (2016-07-04) S3 volume implementation, an S3 volume can store blocks safely, but cannot do garbage collection safely.

There is a command line flag -s3-unsafe-delete that enables unsafe garbage collection. With this setting, it is possible to lose data: if a client writes a new copy of a previously stored block that is now scheduled for deletion, the newly written copy can be deleted even though the client is told the data is safely stored.

S3 API challenge

S3 offers eventual consistency. Specifically:
  • Reading an object doesn't necessarily return the latest version: "write X=foo; write X=bar; read X" might return foo.
  • Deleting is not immediate: "write X=foo; delete X; read X" might return foo.
  • There is no atomic conditional write operation (e.g., "overwrite X only if X has etag=foo").

This makes it hard to do garbage collection. How can we delete X if we don't know whether X has been written recently?

Proposed approach

The state of a block is represented by the existence and timestamps of three objects:
  • Data object, name="{md5}" -- contains the stored data; timestamp and metadata are irrelevant.
  • Mtime object, name="recent/{md5}" -- zero data bytes; Last-Modified reflects most recent Put or Touch.
  • Trash object, name="trash/{md5}" -- contains a copy of the stored data; Last-Modified reflects most recent Trash.
The strategy is:
  1. Tolerate, and automatically clean up, untidy state caused by races.
  2. Avoid entering races where there's a possibility of losing data.
Specifically, there is a window, just before a trashed block becomes eligible for final deletion, when it is impossible to safely trash a newer copy of that block:
  • Racing EmptyTrash ("delete trash/X if it's old enough") against Trash ("copy X to trash/X, then delete X") can result in both copies (X and trash/X) being deleted: EmptyTrash inspects state and issues a "delete trash/X" request; then Trash issues "copy X to trash/X" and "delete X" requests; then the S3 service performs those two operations in the reverse order, and both copies are deleted.
  • Meanwhile, if Put(X) was also racing, we would discover next time we call FixRace() that X shouldn't have been trashed at all, and we should rescue it from the trash ... but that would be impossible.
X recent/X trash/X State
- - X is not trash. It was written by an old version, or there was a race between EmptyTrash and Put. To fix, create a new recent/X.
new - X is not trash, and is new. If it appears in a trash list, do nothing.
old - X is not trash, and is old. If it appears in a trash list, trash it.
any X is not trash. There was a race between Trash and Put, or Trash didn't (yet) finish.
- old new* X is trash, not yet eligible for deletion. *newer than (now - trashLifetime + raceWindow)
- old racy* X is trash, not yet eligible for deletion, and prone to races with Trash. *older than (now - trashLifetime + raceWindow)
- old old* X is trash, eligible for deletion. *older than (now - trashLifetime)
- new X is not trash. There was a race between Trash and Put.
- - Impossible? (Race between Trash, EmptyTrash, and Put?) To fix, touch recent/X and copy trash/X to X.

Implementation draft

Put(X):
  1. Write data to data object, "X"
  2. Write zero bytes to mtime object, "recent/X"
Touch(X):
  1. Write zero bytes to mtime object, "recent/X"
  2. Ensure data object "X" exists. If not:
    • If FixRace(X) returns false, return ErrNotExist
    • Else return nil (OK)
    • (Handlers only use Touch after Get succeeds, so a 404 here is almost certainly an early opportunity to catch and correct a touch-vs.-trash race.)
Mtime(X):
  1. Ensure data object "X" exists. If not, return ErrNotExist.
  2. Ensure "recent/X" exists. If not, create it.
  3. Return LastModified("recent/X")
Trash(X):
  1. If trashLifetime==0 && !unsafeDelete: "not implemented" error
  2. If "trash/X" exists and LastModified("trash/X") + trashLifetime < now + raceWindow, then it is not safe to delete "X", so return without doing anything.
  3. Get "recent/X"
  4. PutCopy("X", "trash/X")
  5. Del("X")
Get(X):
  1. If Get("X") succeeds, return content
  2. Else if Get returned a 404, and FixRace(X) returns true, call Get("X") again and return content
FixRace(X): "check whether X needs to be cleaned up after a race; if so, clean it up. Return true if X was untrashed during cleanup."
  1. Use HEAD to get LastModified("trash/X"). If "trash/X" does not exist, return false.
  2. Use HEAD to get LastModified("recent/X").
  3. If LastModified("trash/X") < LastModified("recent/X") + permissionTTL, a touch/trash race went wrong, so we need to fix it:
    • Perform HEAD requests for "X" and "trash/X"
    • PutCopy("trash/X", "X") (but don't delete "trash/X" -- EmptyTrash will get it next time)
    • Log this event.
    • Return true.
  4. Return false.
Untrash(X):
  1. PutCopy("trash/X", "X") (but don't delete "trash/X" -- EmptyTrash will get it next time)
EmptyTrash():
  1. List all trash objects. For each X...
  2. Look up corresponding mtime object ("recent/X")
  3. If LastModified("trash/X") < LastModified("recent/X") + permissionTTL: call FixRace(X)
  4. Else if now() > LastModified("trash/X") + trashLifetime, it's time to really delete this object:
    • Del("trash/X")
    • Del("recent/X") unless "X" exists
Index():
  1. List all data objects and mtime objects ("X" and "recent/X"), using a merge sort to get them in matching pairs.
  2. Include X in the index only if "X" and "recent/X" exist.
  3. In the index response row "hash+size timestamp": hash is X, size is Size("X"), timestamp is LastModified("recent/X")
Sometimes, races will cause valuable blocks to be trashed, but they will be recovered automatically by FixRace(), provided:
  • the S3 service achieves eventual consistency faster than raceWindow, and
  • raceWindow < trashLifetime

These "accidentally trashed" blocks will be omitted from the index, which means keep-balance will occasionally report them as "lost". This is undesirable, because never cry wolf.

Unresolved issues

Is there a deadline for eventual consistency? IOW, is there a value raceWindow can be set to?