Feature #8555

[Keep] Implement trash/untrash behavior in s3_volume

Added by Peter Amstutz over 5 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Keep
Target version:
Start date:
06/16/2016
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
0.5

Subtasks

Task #9424: Enhance S3 library to support if-unmodified-since header on Del -- https://github.com/AdRoll/goamz/issues/429Closed

Task #9384: Review 8555-s3-trash @ commit:5f00956ResolvedRadhika Chippada


Related issues

Related to Arvados - Story #8178: [Keepstore] Implement interfaces trash area and undelete endpointResolved01/11/2016

Related to Arvados - Story #7988: [Keep] Single keepstore responsible for trash lists on S3Closed

Associated revisions

Revision 7ba6bdf4
Added by Tom Clegg about 5 years ago

Merge branch '8555-s3-trash'

closes #8555

History

#1 Updated by Peter Amstutz over 5 years ago

  • Subject changed from [Keep] Implement trash/untrash behavior in volume_s3 to [Keep] Implement trash/untrash behavior in s3_volume

#2 Updated by Brett Smith over 5 years ago

  • Target version set to Arvados Future Sprints

#3 Updated by Radhika Chippada over 5 years ago

  • Assigned To set to Radhika Chippada
  • Target version changed from Arvados Future Sprints to 2016-06-22 sprint

#4 Updated by Radhika Chippada over 5 years ago

  • Story points set to 2.0

#5 Updated by Tom Clegg over 5 years ago

The S3 API doesn't have "update metadata" -- in order to set a "trash" flag in metadata, we'd need to download/re-upload or use the "copy" operation.

I see two approaches that could work:
  1. Add a "trash marker" object. We explored this in Azure Storage but ended up going with metadata instead. Drawbacks: every operation requires multiple S3 API calls ("check trash"), IndexTo has to merge the trash-marker list and the real-object list, and there are lots of races to think about.
  2. Use "copy-and-delete" to emulate "rename", and use the same approach as UnixVolume, i.e., "{hash}.trash.{exp}" is a trashed block that should be deleted at time {exp}. We could consider a different naming scheme like "trash.{hash}.{exp}" so EmptyTrash doesn't have to traverse the entire index.

#6 Updated by Peter Amstutz over 5 years ago

+1 on #2.

AWS provides "PUT Object - Copy" mode for efficient server-side copying. Also agree on having a naming scheme that makes it easy to iterate over just the trash.

#7 Updated by Tom Clegg over 5 years ago

We already use PutCopy() to update last-modified time in Touch() -- see source:services/keepstore/s3_volume.go

Watch out for this:

"This means that a 200 OK response can contain either a success or an error. Make sure to design your application to parse the contents of the response and handle it appropriately." -- http://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectCOPY.html

It looks like the goamz library does not check this for us, and the easiest way to verify we're getting "200 OK = success" rather than "200 OK = error" is to check result.LastModified the way we do in Touch().

Also watch out for a race during copy-and-rename:
  1. goroutine-1 calls PutCopy() with intent to trash
  2. goroutine-2 calls PutCopy() with intent to touch
  3. goroutine-2 receives successful response from PutCopy()
  4. goroutine-2 returns success ("block is safe")
  5. goroutine-1 receives successful response from PutCopy(), and goes on to trash the block
Using the "x-amz-copy-source-if-unmodified-since" header in Trash() should be enough to save us here.
  • We need to add that feature to the goamz library (s3/s3.go → CopyOptions and (CopyOptions)addHeaders())
  • We already need to add "if-unmodified-since" support to (*Bucket)Del() so we can run without "-s3-unsafe-delete".

Of course, Touch() is also a reminder that we could use PutCopy() to update/replace metadata without actually reading/rewriting the data. This would eliminate the possibility of iterating over the trash without iterating over the entire index. (This isn't necessarily a problem -- normally "empty trash" runs less frequently than datamanager/keep-balance, which get the entire index.)

#8 Updated by Tom Clegg over 5 years ago

Unlike Azure, S3 does not seem to offer an API that retrieves metadata for more than one object at a time. Therefore, in order to index a bucket with N items without doing N API calls, we must mark trash using something other than just metadata.

This leaves us with "strategy number 2": Use "copy-and-delete" to emulate "rename", and do something similar to UnixVolume.

Trash:
  1. Get mtime of {hash}
  2. If mtime > {now minus permissionTTL} then skip
  3. Use PutCopy to copy {hash} to trash.{hash}.{exp}
  4. Delete {hash} If-Unmodified-Since {mtime we just looked up} (note: if precondition fails due to a race, we can end up with two copies, one in trash and one not; that's OK.)
EmptyTrash:
  1. List objects whose names match trash.* ("trashed block")
  2. If {exp} > now, skip
  3. Delete trashed block
Untrash:
  1. List objects whose names match trash.{hash}.*; pick the first one.
  2. Use PutCopy to copy trash.{hash}.{exp} to {hash}

#9 Updated by Brett Smith over 5 years ago

As much as reasonable, I would like it if we avoided leaving around extra copies of blobs. Those are likely to cost users, either from cloud billing or by taking up additional space on S3-speaking local storage.

If I followed the implementation sketches correctly, this just means adding a couple of steps:

Tom Clegg wrote:

Trash:
  1. Get mtime of {hash}
  2. If mtime > {now minus permissionTTL} then skip
  3. Use PutCopy to copy {hash} to trash.{hash}.{exp}
  4. Delete {hash} If-Unmodified-Since {mtime we just looked up} (note: if precondition fails due to a race, we can end up with two copies, one in trash and one not; that's OK.)

If the delete failed, then we should try to delete trash.{hash}.{exp}. If that fails too, it's not a fatal problem, but we should at least try to clean up after the known race.

Untrash:
  1. List objects whose names match trash.{hash}.*; pick the first one.
  2. Use PutCopy to copy trash.{hash}.{exp} to {hash}

After this, delete trash.{hash}.{exp}.

#10 Updated by Radhika Chippada over 5 years ago

Tom said:

Trash step 4: Delete {hash} If-Unmodified-Since {mtime we just looked up}

This requires updating the S3 library to support if-unmodified-since header. I created an issue for the S3 library. https://github.com/AdRoll/goamz/issues/429

#11 Updated by Radhika Chippada over 5 years ago

After much digging and discussing with Tom, here are a few thoughts:

  • S3 does not seem to provide any support for If-Unmodified-Since on Delete requests. This hence does not seem doable.
  • It appears that we might be able to make do with "versioning". Assuming we enable versioning in our S3 storage (and configure to delete old versions swiftly to minimize storage costs):
    • Mtime returns versionId
    • PutCopy (with trash.hash.expire)
    • DelMulti with S3 Object consisting of the versionId from Mtime

#12 Updated by Tom Clegg over 5 years ago

Even in PutCopy, which accepts conditional headers ("If-Unmodified-Since" etc.), the conditions refer to the source, not the destination. There just aren't any atomic read-and-write operations. "Amazon S3 is a distributed system. If it receives multiple write requests for the same object simultaneously, it overwrites all but the last object written. Amazon S3 does not provide object locking; if you need this, make sure to build it into your application layer or use versioning instead."

#13 Updated by Radhika Chippada over 5 years ago

There is really no straightforward way to do this since the Delete operation does not support a If-Unmodified-Since like behavior. After much deliberation, Tom and I came up with a plan that could work:

  • When a Locator object is being Put, also store a separate marker (<hash>-last-modified-at ?) with the Modified-At timestamp
  • Update it during Touch operation (along with the object as we do now)
  • Do not touch it during Trash operation (Make a PutCopy of <hash>-trash-<expiry> and delete the Loc object)
    • If a race occurred due to Touch between PutCopy and Delete obj operations, we do not have the object anymore, but still have the marker and the trash'ed copy
  • During EmptyTrash (and IndexTo?), when working with <hash>-trash-*, check if the last-modified-at marker is newer than TrashLifetime. If so, restore the Object and delete the trash copy.

#14 Updated by Tom Clegg over 5 years ago

Proposed approach: Each block is represented by (up to) three S3 objects.
  • Data object, name="{md5}" -- contains the stored data; timestamp and metadata are irrelevant.
  • Mtime object, name="recent/{md5}" -- zero data bytes; Last-Modified reflects most recent Put or Touch.
  • Trash object, name="trash/{md5}" -- contains a copy of the stored data; Last-Modified reflects most recent Trash.
Put(X):
  1. Write data to data object, "X"
  2. Write zero bytes to mtime object, "recent/X"
Touch(X):
  1. Write zero bytes to mtime object, "recent/X"
  2. Ensure data object "X" exists. If not:
    • If FixRace(X) returns false, return ErrNotExist
    • Else return nil (OK)
    • (Handlers only use Touch after Get succeeds, so a 404 here is almost certainly an early opportunity to catch and correct a touch-vs.-trash race.)
Mtime(X):
  1. Ensure data object "X" exists. If not, return ErrNotExist.
  2. Ensure "recent/X" exists. If not, create it.
  3. Return LastModified("recent/X")
Trash(X):
  1. If trashLifetime==0 && !unsafeDelete: "not implemented" error
  2. Get "recent/X"
  3. PutCopy("X", "trash/X")
  4. Del("X")
Get(X):
  1. If Get("X") succeeds, return content
  2. Else if Get returned a 404, and FixRace(X) returns true, call Get("X") again and return content
FixRace(X): "check whether X needs to be cleaned up after a race; if so, clean it up. Return true if X was untrashed during cleanup."
  1. Use HEAD to get LastModified("recent/X").
  2. If "recent/X" does not exist, an emptyTrash/Trash race went wrong, so we need to fix it by saving the very conservative value "now" as "most recent Put or Touch":
    • Put("recent/X", zeroBytes)
    • Log this event.
  3. Use HEAD to get LastModified("trash/X"). If "trash/X" does not exist, return false.
  4. If LastModified("trash/X") < LastModified("recent/X") + permissionTTL, a touch/trash race went wrong, so we need to fix it:
    • Perform HEAD requests for "X" and "trash/X"
    • If "X" exists but etag differs from "trash/X", log a collision warning and do nothing
    • If "X" exists and etag matches, delete "trash/X"
    • If "X" does not exist, PutCopy("trash/X", "X") (but don't delete "trash/X" -- EmptyTrash will get it next time)
    • Log this event.
    • Return true.
  5. Return false.
Untrash(X):
  1. PutCopy("trash/X", "X") (but don't delete "trash/X" -- EmptyTrash will get it next time)
EmptyTrash():
  1. List all trash objects and mtime objects ("trash/X" and "recent/X"), using a merge sort to get them in matching pairs. For each X...
  2. If LastModified("trash/X") < LastModified("recent/X") + permissionTTL: call FixRace(X)
  3. Else if now() > LastModified("trash/X") + trashLifetime, it's time to really delete this object:
    • Del("trash/X")
    • Del("recent/X")
Index():
  1. List all data objects and mtime objects ("X" and "recent/X"), using a merge sort to get them in matching pairs.
  2. Include X in the index only if "X" and "recent/X" exist.
  3. In the index response row "hash+size timestamp": hash is X, size is Size("X"), timestamp is LastModified("recent/X")

I think this approach avoids losing data in races. Sometimes, races will cause valuable blocks to be trashed, but they will be recovered automatically by FixRace(), provided the S3 service achieves eventual consistency faster than trashLifetime.

These "accidentally trashed" blocks will be omitted from the index, which means keep-balance will occasionally report them as "lost". This is slightly off-putting, because never cry wolf.

#15 Updated by Brett Smith over 5 years ago

Various operations talk about the return value of FixRace, but that return value isn't defined AFAICT.

This is less of a design consideration, and more something to bear in mind for implementation: all S3 API calls cost money, at least in the cloud. I think this proposal is as efficient as it can be, but I would appreciate if the code took care to minimize underlying API calls. For example, in part 2 of FixRace, I believe the code can call HEAD trash/X, use that result for the Last-Modified comparison, and continue using the result if we fall into the conditional, without making a second HEAD request.

#16 Updated by Tom Clegg over 5 years ago

Edited note-14 to address these points:
  • Use "list + merge sort" strategy in EmptyTrash to avoid calling FixRace() (and therefore using HEAD) on every single trashed block.
  • Specify FixRace() return value (true means data object "X" has reappeared).
  • Be explicit about when to call HEAD in FixRace, with a view to conserving S3 API calls.

#17 Updated by Tom Clegg over 5 years ago

Possible flaw: a race between FixRace() and Trash() could cause both "X" and "trash/X" to be deleted.

This would require:
  • Both "X" and "trash/X" exist: e.g., "write X; trash X; write X"
  • recent/X is older than permissionTTL (otherwise Trash won't call Del(X))
  • recent/X is newer than trash/X - permissionTTL (otherwise FixRace won't call Del(trash/X))

So far, I have neither come up with a scenario ending up in this corner, nor convinced myself that no such scenario exists.

#18 Updated by Peter Amstutz over 5 years ago

(I haven't read the whole thread, something just popped into my head)

Can we serialize access by only allowing a single keep server to send delete requests to the S3 bucket?

#19 Updated by Brett Smith over 5 years ago

  • Target version changed from 2016-06-22 sprint to 2016-07-06 sprint

#20 Updated by Radhika Chippada over 5 years ago

Peter said:

Can we serialize access by only allowing a single keep server to send delete requests to the S3 bucket?

The problem is not delete per se, but a touch / update operation happening on an object while the two-step trash operation is in progress (PutCopy trash-X and Del X). Even if we serialize as you suggested, we cannot prevent another keep server from touch'ing the object while trash is in progress on a different server.

That said, I am wondering if we can wrap S3 object updates (Put, Touch, Del) in a "transaction" using DB (or some other means)? How about we simulate a JTA like behavior using: begin db transaction; insert an object uuid into a locking table; Do the S3 operation(s); rollback transaction? We can alternatively use S3 storage itself to insert a lock object (lock-<hash>) and removing it after the actual object operation, but this will require us to manage server crashes ourselves (may be we can expand keepstore signal handling ...).

#21 Updated by Radhika Chippada over 5 years ago

Possible flaw: a race between FixRace() and Trash() could cause both "X" and "trash/X" to be deleted. ... So far, I have neither come up with a scenario ending up in this corner, nor convinced myself that no such scenario exists.

Both "X" and "trash/X" exist: e.g., "write X; trash X; write X"
recent/X is older than permissionTTL (otherwise Trash won't call Del(X))
recent/X is newer than trash/X - permissionTTL (otherwise FixRace won't call Del(trash/X))

It can be a potential race condition. Let's say, a Trash and Touch operations occurred at 9am on two different threads such that trash/X = 9:00 and recent/X (by touch) completes at 9:01 and hence recent/X is newer than trash/X. Let's say TTL = 1hr and at 11.30 the following is happening:

Timeline Keep Balance Empty Trash calls FixRace
0 Call Trash Call FixRace
1 Get "recent/X"
2 Get HEAD for X and trash/X
3 PutCopy("X", "trash/X")
4 X exists, so Delete trash/X
5 Del("X")

#22 Updated by Radhika Chippada over 5 years ago

Would it help if we do not delete the trash/X in FixRace?

#23 Updated by Radhika Chippada over 5 years ago

  • Target version changed from 2016-07-06 sprint to Arvados Future Sprints

#24 Updated by Tom Clegg over 5 years ago

Copied note-14 to S3 bucket volume implementation with a fix for the "trash vs. empty-trash" race.

#25 Updated by Tom Clegg over 5 years ago

  • Target version changed from Arvados Future Sprints to 2016-07-20 sprint

#26 Updated by Tom Clegg over 5 years ago

  • Category set to Keep
  • Assigned To changed from Radhika Chippada to Tom Clegg

#27 Updated by Tom Clegg over 5 years ago

  • Status changed from New to In Progress

#28 Updated by Tom Clegg over 5 years ago

8555-s3-trash @ 2c64a08

Existing tests pass (with one change: needed to permit Trash to fail when the same block is already in trash).

TODO: I'd like to add a set of s3-specific tests as well, to make sure the various states listed on the wiki are handled correctly. So this is "mostly ready" for review -- presumably the code will change if the state-table tests reveal bugs.

#29 Updated by Radhika Chippada over 5 years ago

On first pass, things look all right and I will make one more pass after you add those tests. A few style comments in the meantime:

  • ErrS3TrashNotAvailable => ErrS3TrashNotPermitted ?
  • ”trash is not possible because …” => "trash is not permitted because …” =>
  • getReaderWithFixRace => I think calling it just getReader and adding a comment for the method (if loc not found and recent/loc found …) would be helpful
  • In Trash, the block ( // Make sure we're not in the race window ) would be easier to understand if this is a separate if/else statement ( } else if t, err := v.lastModified(resp); err != nil { )
  • go fmt and lint issues

#30 Updated by Tom Clegg about 5 years ago

  • Target version changed from 2016-07-20 sprint to 2016-08-03 sprint
  • Story points changed from 2.0 to 0.5

#31 Updated by Radhika Chippada about 5 years ago

s3_volume - IndexTo

  • Really awesome

s3_volume - EmptyTrash

  • Question about the following block: since we are checking it and assuming this is a possible scenario: When err == os.IsNotExist (rather than a transient error), do we want to fixRace? Otherwise, we could basically end up with an orphaned trash loc?
        recent, err := v.Bucket.Head("recent/"+loc, nil)
        if err != nil {
            log.Printf("warning: %s: EmptyTrash: cannot delete trash %q with no corresponding recent/* marker", v, trash.Key)
            continue
        }
  • We are doing one Head request one at a time for each recent/loc. Is there an opportunity to improve performance here by getting the recentL s3Lister? Since we are doing a fixRace when "trashT.Sub(recentT) < blobSignatureTTL", we can use the value from recentL when it meets this condition; all other cases (not found in recentL or greater than TTL), we get it explicitly using Head and use that to determine the rest of the logic as we have it now.
  • what is the reason to not print the stats like we did in unix and azure storage cases?
  • can you call "now" as "startAt"?

TestBackendStates

  • It would be easier to follow if we called time attrs in TestBackendStates struct as dataT, TrashT, recentT etc
  • It would greatly help in readability if you could please update the scenarios as follows:
--- a/services/keepstore/s3_volume_test.go
+++ b/services/keepstore/s3_volume_test.go
@@ -147,7 +147,8 @@ func (s *StubbedS3Suite) TestBackendStates(c *check.C) {
                {
                        "No related objects",
                        none, none, none,
-                       false, false, false, false, false, false},
+                       false, false, false, false, false, false,
+               },
                { ... 
  • Can we call stubKey as putStubKey?
  • (after much hair pulling) I understood that the 4 setup calls in TestBackendStates are doing 4 entirely unrelated tests in this one Test. Can you please add comments for each setup call and make sure it is clear to the reader of this test that the 4 loc's are different? Please also consider renaming setup as setupScenario.
  • Please consider adding small comment to explain what each scenario is accomplishing, at least the first few. It took me a couple mins to understand, ""Not trash; old enough to trash" meant "it is old enough to trash but we didn't get around to trash it yet"
  • Wondering if we should add a canGetAfterUntrash check as well?
            canTrash            bool
            canGetAfterTrash    bool
            canUntrash          bool
    +       canGetAfterUntrash  bool
    

#32 Updated by Tom Clegg about 5 years ago

Radhika Chippada wrote:

  • Question about the following block: since we are checking it and assuming this is a possible scenario: When err == os.IsNotExist (rather than a transient error), do we want to fixRace? Otherwise, we could basically end up with an orphaned trash loc?

Indeed, the IsNotExist case would result in un-collectable trash. Fixed by calling Untrash in this case. Added test.

  • We are doing one Head request one at a time for each recent/loc. Is there an opportunity to improve performance here by getting the recentL s3Lister? Since we are doing a fixRace when "trashT.Sub(recentT) < blobSignatureTTL", we can use the value from recentL when it meets this condition; all other cases (not found in recentL or greater than TTL), we get it explicitly using Head and use that to determine the rest of the logic as we have it now.

I considered this, but it would mean indexing all timestamp objects, instead of just the markers for trashed items. Which one results in fewer S3 API calls depends on what proportion of the data is trash.

AFAICT there is no API for counting objects, so we can't determine the exact proportion before deciding on a strategy. However, if we rely on the even distribution of md5 hashes, we could get a good estimate by fetching one page of trash/*: If the 100th trash object is trash/0a345... then we can expect 100 * 0x1000 / 0x0a3 = 2512 trash objects.

Using this information we can set a threshold to determine strategy: e.g., if we expect more than 200 non-trash items per trash item, then we should do individual HEAD requests (resulting in <5× as many API calls, and >100× less response data to transfer) -- otherwise we should do a merge sort like IndexTo.

  • what is the reason to not print the stats like we did in unix and azure storage cases?

Forgot. Added in 99aa076.

  • can you call "now" as "startAt"?

Changed to startT.

  • It would be easier to follow if we called time attrs in TestBackendStates struct as dataT, TrashT, recentT etc

Renamed.

  • It would greatly help in readability if you could please update the scenarios as follows:

Reformatted.

  • Can we call stubKey as putStubKey?

Renamed to putS3Obj().

  • (after much hair pulling) I understood that the 4 setup calls in TestBackendStates are doing 4 entirely unrelated tests in this one Test. Can you please add comments for each setup call and make sure it is clear to the reader of this test that the 4 loc's are different? Please also consider renaming setup as setupScenario.

Renamed to setupScenario, and updated it to return (loc, blk) explicitly, so it's much easier to see that loc, blk change with every setup.

  • Please consider adding small comment to explain what each scenario is accomplishing, at least the first few. It took me a couple mins to understand, ""Not trash; old enough to trash" meant "it is old enough to trash but we didn't get around to trash it yet"

Updated the labels to explain better (I hope).

  • Wondering if we should add a canGetAfterUntrash check as well?

Added.

Also added a "Put causes fresh Mtime" test for all scenarios.

Now at 5f47ebc

#33 Updated by Radhika Chippada about 5 years ago

  • Do we want to capture any errors during Untrash in the following (newly added block) and log?
        if err != nil && os.IsNotExist(v.translateError(err)) {
            log.Printf("warning: %s: EmptyTrash: found trash marker %q but no %q (%s); calling Untrash", v, trash.Key, "recent/"+loc, err)
            v.Untrash(loc)
            continue
        } else if err != nil {
            log.Printf("warning: %s: EmptyTrash: HEAD %q: %s", v, "recent/"+loc, err)
  • In TestBackendStates, can you please add a one liner comment for each "loc, blk := setupScenario?" Something like: "can Trash loc in this scenario." I think it would help see what is coming since the test is quite long and requires scrolling back and forth to see what is going on
  • go fmt and golint issues

#34 Updated by Tom Clegg about 5 years ago

Radhika Chippada wrote:

  • Do we want to capture any errors during Untrash in the following (newly added block) and log?

Yes, good point. Added a log message when Untrash fails.

  • In TestBackendStates, can you please add a one liner comment for each "loc, blk := setupScenario?" Something like: "can Trash loc in this scenario." I think it would help see what is coming since the test is quite long and requires scrolling back and forth to see what is going on

Added comments

  • go fmt and golint issues

Fixed, and updated run-tests.sh to fail on gofmt diffs.

#35 Updated by Radhika Chippada about 5 years ago

(go fmt) Fixed, and updated run-tests.sh to fail on gofmt diffs.

This is great. Thanks for adding it. (The first time you see the error, I am not sure if it is quite clear why the tests failed. Please consider announcing this in irc.)

LGTM

#36 Updated by Tom Clegg about 5 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 50 to 100

Applied in changeset arvados|commit:7ba6bdf406546ec225baea49dbe6ccbf02e70f53.

Also available in: Atom PDF