Feature #8556
closed[Keep] Implement trash/untrash behavior in azure_blob_volume
Description
Implementation¶
- Trash → set
<ExpiresAt>123</ExpiresAt>
and create a marker (just an empty blob) namedtrash/000000000123/hash
- Untrash → unset
ExpiresAt
and delete trash/000000000123/hash - EmptyTrash → index
trash/*
in order. For each trash marker withdeadline <= now
:- Look up the modtime and ExpiresAt for the data block itself.
- If the data is still old and ExpiresAt is still set, delete the data block (otherwise do nothing).
- Delete the trash marker (regardless of whether the data block was deleted).
- Get, Mtime, Touch, IndexTo → check
ExpiresAt
metadata. If present, return/skip as if the block was not found at all.
Tests / Acceptance Criteria¶
Ensure {Put; Trash; Put} results in ExpiresAt being cleared. This might happen by itself, or we might need to use SetBlobMetadata to clear the metadata explicitly after writing each block. (The test for this should already be in place, in the generic volume test: {Put; Backdate; Trash; Put; Backdate; empty trash}.)
Ensure the Azure server stub behaves the same as the real Azure server wrt any feature we're relying on for this. E.g., it's not enough to know that our stub clears the metadata when overwriting a block. We need to know that the real Azure service is documented to behave that way, and actually does behave that way (e.g., via manual testing with real Azure containers).
All existing tests in volume_generic_test must pass, even when "Trash with trashLifetime>0" is implemented.
In the first set of testTrashEmptyTrashUntrash, with trashLifetime>0, after the "Get() returns notfound", check:- Mtime → notfound
- Compare → notfound
- IndexTo does not include the trashed block
- Touch → notfound
- Mtime → ok
- Compare → ok
- IndexTo includes the block
- Touch → ok
- Mtime → notfound
- Compare → notfound
- IndexTo does not include the trashed block
- Touch → notfound
Future work (defer)¶
When doing a full index (prefix=""), build two trash lists: a list of blocks whose metadata indicates that they are trash, and a list of trash markers. After sending the index response and closing the connection, delete any trash markers whose block metadata indicates that they are not trash.
Updated by Brett Smith almost 9 years ago
- Target version set to Arvados Future Sprints
Updated by Peter Amstutz almost 9 years ago
Relevant APIs
"Lease Container" operation establishes and manages a lock on a container for delete operations.
"List Blobs" operation enumerates the list of blobs under the specified container. Can query by prefix. Blob metadata is returned by the list blobs operation with the "include=metadata" parameter.
"Delete Blob" operation marks the specified blob or snapshot for deletion. When a blob is successfully deleted, it is immediately removed from the storage account's index and is no longer accessible to clients. The blob's data is later removed from the service during garbage collection. It is an asynchronous operation (returns HTTP 202 Accepted)
"Copy Blob" operation copies a blob to a destination within the storage account. It is an asynchronous operation (returns HTTP 202 Accepted).
"Get Blob" reads or downloads a blob from the system. Blob metadata is returned in the headers.
Updated by Peter Amstutz almost 9 years ago
The biggest limiting factor is that "List Blobs" has extremely limited options for filtering (pretty much just by prefix). Filtering by metadata, for example, isn't supported at all.
Using "Copy Blob" incurs additional cost:
When you copy a source blob to a destination blob with a different name within the same account, you use additional storage resources for the new blob, so the copy operation results in a charge against the storage account’s capacity usage for those additional resources. However, if the source and destination blob name are the same within the same account (for example, when you promote a snapshot to its base blob), no additional charge is incurred other than the extra copy metadata stored in version 2012-02-12 and newer.
When you promote a snapshot to replace its base blob, the snapshot and base blob become identical. They share blocks or pages, so the copy operation does not result in an additional charge against the storage account's capacity usage. However, if you copy a snapshot to a destination blob with a different name, an additional charge is incurred for the storage resources used by the new blob that results. Two blobs with different names cannot share blocks or pages even if they are identical. For more information about snapshot cost scenarios, see Understanding How Snapshots Accrue Charges.
It's not clear what happens if you initiate a Copy Blob operation and then perform Delete Blob before the copy is complete (according to Stack Overflow, this is bad news: https://stackoverflow.com/questions/3734672/azure-storage-blob-rename)
We can update/get metadata on a easily on a per-block basis, just not filter Blob List on it.
Updated by Peter Amstutz almost 9 years ago
Tentative trash design:
- Initiate copy blob XYZ to "trash/XYZ" and set metadata
<ExpiresAt>123</ExpiresAt>
on the trash target. - Conditionally (not-modified-since) update metadata on the source blob to
<ExpiresAt>123</ExpiresAt>
- List blobs with prefix "trash/"
- Conditionally (not-modified-since) try to delete the active blob that the trash blob was copied from
- Unless the trashed blob has metadata
<SourceDeleted>true</SourceDeleted>
so we don't keep doing this all the time. - Set metadata
<SourceDeleted>true</SourceDeleted>
on the trash blob after deleting the source
- Unless the trashed blob has metadata
- Actually delete blobs in "trash/" for which
<ExpiresAt>123</ExpiresAt>
has passed. - When fetching a blob, treat it as a "not found" if ExpiresAt is metadata is present.
- Must ensure that any PUTs to active blob store also reset metadata on the blob.
Rationale: the reason for this shuffle is to avoid having to enumerate the full list of active blobs, which could be 100s of thousands of objects. Using the List Blobs API, we can fetch up to 5000 results per API request, it requires using a cursor to request subsequent results. For a million blobs, this means at least 200 API requests, which is likely to take awhile.
Simple design, but requires enumerating the full blob list:
- Conditionally (not-modified-since) update metadata on the source blob to
<ExpiresAt>123</ExpiresAt>
- List all blobs
- Conditionally (not-modified-since) delete blobs in which
<ExpiresAt>123</ExpiresAt>
has passed. - When fetching a blob, treat it as a "not found" if ExpiresAt is metadata is present.
- Must ensure that any PUTs to active blob store also reset metadata on the blob.
Updated by Tom Clegg almost 9 years ago
- Trash → set
<ExpiresAt>123</ExpiresAt>
and create a marker (just an empty blob) namedtrash/0000123/hash
- Untrash → unset
ExpiresAt
and delete trash/0000123/hash - EmptyTrash → index
trash/*
in order until the timestamps start being in the future. For each trash marker, get the modtime and ExpiresAt for the hash; if the data is still old and ExpiresAt is still set, delete it (otherwise do nothing); delete the trash marker in any case.
Doing a full index shouldn't be toooo scary, since we already need to do that on every data manager run...
Updated by Peter Amstutz almost 9 years ago
Tom Clegg wrote:
Could we...
- Trash → set
<ExpiresAt>123</ExpiresAt>
and create a marker (just an empty blob) namedtrash/0000123/hash
- Untrash → unset
ExpiresAt
and delete trash/0000123/hash- EmptyTrash → index
trash/*
in order until the timestamps start being in the future. For each trash marker, get the modtime and ExpiresAt for the hash; if the data is still old and ExpiresAt is still set, delete it (otherwise do nothing); delete the trash marker in any case.Doing a full index shouldn't be toooo scary, since we already need to do that on every data manager run...
That's a good point. In that case, the simple solution (use metadata) might be the way to go. We could make note of deleteable blocks as part of index scanning.
Updated by Tom Clegg almost 9 years ago
- Description updated (diff)
- Category set to Keep
Updated by Brett Smith almost 9 years ago
- Target version changed from Arvados Future Sprints to 2016-03-30 sprint
Updated by Radhika Chippada almost 9 years ago
- Assigned To set to Radhika Chippada
Updated by Radhika Chippada almost 9 years ago
- Status changed from New to In Progress
Updated by Tom Clegg over 8 years ago
- Include trash in index responses. Datamanager will keep seeing the trash, and repeatedly tell keepstore to trash it; keepstore will keep trashing it, and as a side effect the block's last-modified date will be updated, which will prevent the blocks from getting old enough to delete -- unless we stop storing the "expires at" metadata (perhaps "trash" should avoid updating "expires at" anyway when it's already set?). Datamanager will also produce misleading statistics: if a lot of data is mistakenly sent to the trash, subsequent Datamanager runs will report reassuringly large amounts of live/untrashed data, which detracts somewhat from the benefit of having the trash phase. Keep-rsync will try to copy blocks that are in the trash, and error out (or we could fix it so it logs errors and continues).
- Store the trash flag in properties instead of metadata? Properties aren't user-defined: nowhere to put it.
- Get the blob metadata for each item in the list using SDKs that do exist? Too slow: each "index" would have to do a separate query for every single block in the system.
- Before generating index, get a list of trash markers, and omit those blocks from the index response? To avoid omitting blocks that have been written since being trashed, we would have to omit only blocks with
last-trashed >= last-modified
. So far our plan does not involve storing last-trashed, but we could add this: trash markers would be something liketrash/000000000123/hash/000000000099
where 123 is the empty-trash deadline and 99 is the time the trash marker was created. This naming scheme would become redundant as soon as the SDK gets "list blobs with metadata" support. - Add "list blobs with metadata" support to the SDK and submit a PR.
Updated by Radhika Chippada over 8 years ago
Regarding adding "list blobs with metadata" support to the SDK and submit a PR:
- It appears (as Peter pointed out in IRC), we can use "include={metadata}" to ListBlobs call to get the metadata headers.
- Then add these to the Blob struct by adding Metadata map[string]string `xml:"Metadata"`
- Add a test or two
Estimate: 1 story point
Updated by Radhika Chippada over 8 years ago
When SetBlobMetadata to set ExpiresAt metadata attribute during Trash, the Mtime on the blob will be updated as well.
"Any write operation on the blob (including updates on the blob's metadata or properties) changes the last modified time of the blob." https://msdn.microsoft.com/en-us/library/azure/dd179414.aspx
Due to this, we will have failures if we try to invoke Trash on the same block more than once. The second call would find this block to be newer than blobSignatureTTL.
Ideally, we would need to set additional metadata attributes such as lastWrite and use this instead of Mtime to determine when a block should be considered trash. This would need updates to the Put func to set metadata after CreateBlockBlobFromReader.
Updated by Brett Smith over 8 years ago
- Status changed from In Progress to New
- Assigned To deleted (
Radhika Chippada) - Target version changed from 2016-03-30 sprint to Arvados Future Sprints
Updated by Tom Clegg over 8 years ago
- Target version changed from Arvados Future Sprints to 2016-05-11 sprint
Updated by Radhika Chippada over 8 years ago
- Assigned To set to Radhika Chippada
Updated by Radhika Chippada over 8 years ago
- Status changed from New to In Progress
Updated by Radhika Chippada over 8 years ago
- Trash/untrash are implemented in azure volume using the latest azure sdk
- As explained in note 20, a new last_write_at metadata attribute is used to determine when a blob is actually updated / touched.
Updated by Tom Clegg over 8 years ago
I'm not convinced this last_write_at strategy is safe. The fundamental issue seems to be the lack of an atomic "update metadata" operation. Consider the following race:
"empty trash" worker "put" request handler ----------------------------- ----------------------------- client requests PUT abc123 Touch(abc123) -> addToMetadata() notice abc123 is in the trash get blob metadata get blob metadata remove expires_at from local copy of metadata see expires_at as expected add last_write_at=now to local copy of metadata see last_write_at is old write new metadata with last_write_at=now delete blob success! data for abc123 is safe success! data for abc123 is gone
Is there anything preventing this? If not, any thoughts on how to fix it?
Meanwhile, a couple of notes on the code:
checkTrashed returns an error to indicate "yes, it's in the trash", which seems a bit weird. How about returning (bool, error) where true means it's in the trash, false means it's not in the trash, and a non-nil error means we don't know because we encountered an unexpected error? (The current code doesn't really care about the difference between "error" and "is in trash", but "is in trash" is generally an indication that something dangerous has happened -- if a client has a valid permission tokens for a block, it shouldn't be in the trash! -- so we should be logging/reporting this.)
v.translateError(NotFoundError) is redundant, since that just returns NotFoundError. But in any case, Get() is required to return something satisfying os.IsNotExist() (like os.ErrNotExist) if the block doesn't exist. IOW please fix the code so the IsNotExist tests pass, instead of weakening the tests with a strings.Contains() exemption.
The new checks in testTrashEmptyTrashUntrash seem like they could be in the checkGet() helper: if Get() fails, then Mtime and Compare must also fail with an IsNotExist() error; if Get() succeeds, then Mtime and Compare must also succeed.
In the Azure Trash() function we seem to have changed the sequence from "Etag, Mtime" to "Mtime, Etag", but left the comment in place that explains why the sequence is "Etag, Mtime". No fair! I think we need to change this back.
In EmptyTrash, the "noMoreOldMarkers" flag could be replaced by putting a label on the outer "for" loop:
blobListPage: for { ... for _, b := range resp.Blobs { ... if ... { break blobListPage } } }
Updated by Radhika Chippada over 8 years ago
I'm not convinced this last_write_at strategy is safe ... && In the Azure Trash() function we seem to have changed the sequence from "Etag, Mtime" to "Mtime, Etag" ...
I introduced this problem by switching the Etag order. Corrected this
checkTrashed returns an error to indicate "yes, it's in the trash", which seems a bit weird. How about returning (bool, error) where true means it's in the trash, false means it's not in the trash, and a non-nil error means we don't know because we encountered an unexpected error?
This is much better and I updated accordingly
v.translateError(NotFoundError) is redundant, since that just returns NotFoundError. But in any case, Get() is required to return something satisfying os.IsNotExist() (like os.ErrNotExist) if the block doesn't exist. IOW please fix the code so the IsNotExist tests pass, instead of weakening the tests with a strings.Contains() exemption.
Made this update
The new checks in testTrashEmptyTrashUntrash seem like they could be in the checkGet() helper: if Get() fails, then Mtime and Compare must also fail with an IsNotExist() error; if Get() succeeds, then Mtime and Compare must also succeed.
Moved Mtime and Compare into the Get() helper.
In EmptyTrash, the "noMoreOldMarkers" flag could be replaced by putting a label on the outer "for" loop:
Added the label
Updated by Radhika Chippada over 8 years ago
Commit d5e4ac8e6
Check Etag while deleting a block during EmptyTrash routine.
Updated by Tom Clegg over 8 years ago
Mtime needs to return a correct result for a blob that was written to Azure storage by an earlier version of keepstore (i.e., a blob with no last_write_at metadata). Its comment should also be updated to match its new behavior.
We have multiple sources of truth here, and I think it would help to write down the rules about the significance of each state and how it's possible to arrive at each one.
last_write_at | expires_at | trash marker | result | how can it happen? |
none | none | none | do not delete | block was written by old version. |
none | none | yes, 20s ago | do not delete | Trash() lost an addToMetadata race, then succeeded in adding a trash marker |
none | 20s ago | yes, 20s ago | delete | block was written by old version, and is trash. |
4w ago | 20s ago | yes, 20s ago | delete | block was written by new version, and is trash. |
none/anything | 20s ago | none | block cannot be retrieved, and is not included in index, but will never be deleted | do we guarantee this can never happen? |
30s ago | 20s ago | yes, 20s ago | block was written recently, but will be deleted (!!) | do we guarantee this can never happen? |
30s ago | none | yes, 20s ago | trash marker will be deleted, data will not be deleted | block was written, trashed, and then rewritten |
addToMetadata and removeFromMetadata are still prone to races:
"process trash list" goroutine "PUT request" handler -------------------------------------------- -------------------------------------------- get metadata get metadata metadata has last_write_at=(old) set metadata[expires_at]=now+TTL set metadata[last_write_at]=now (leave metadata[last_write_at] alone) write metadata to Azure write metadata to Azure success! last_write_at is fresh, so block is safe. success! last_write_at is old, so block is trashed / scheduled to be deleted.
In EmptyTrash, if GetBlobProperties() fails because the data blob does not exist, we should delete the trash marker. Otherwise, we'll keep retrying (and logging failures) indefinitely, which seems pointless.
checkTrashed still seems weird. It says "Return NotFoundError if trash marker is found on the block". Shouldn't it returntrue, nil
in this case? That seems to be the way its return values are interpreted by callers. Also,
- Elsewhere, "trash marker" means the placeholder blob called "trash.exp.hash". This function claims to check the trash marker, but it actually checks the metadata entry.
v.translateError(NotFoundError)
should be just NotFoundError. Generally,(*AzureBlobVolume)translateError()
translates Azure SDK errors: we shouldn't be passing it other errors, like the ones returned bystrconv.ParseInt()
.
if trashed == true
should be just if trashed
- checkTrashed → GetBlobMetadata
- addToMetadata → GetBlobMetadata; SetBlobMetadata
go vet
shows some missing arguments for Printf()
Updated by Brett Smith over 8 years ago
- Target version changed from 2016-05-11 sprint to 2016-05-25 sprint
Updated by Radhika Chippada over 8 years ago
A lot changed in 22db012b
- Enhanced Azure SDK to support sending extra headers in SetBlobMetadata requests so that we do not have race conditions between two different threads doing trashing and putting.
- Got rid off the trash marker and last-write-at extra metadata attribute. We now go through all the blobs in storage during empty trash routine
Addressed the comments in note 30:
Mtime needs to return a correct result for a blob that was written to Azure storage by an earlier version of keepstore (i.e., a blob with no last_write_at metadata). Mtime comment should also be updated to match its new behavior.
Reverted back to previous implementation of Mtime since no longer using last-write-at, after checking if trashed and hence this issue is no longer applicable.
We have multiple sources of truth here, and I think it would help to write down the rules about the significance of each state and how it's possible to arrive at each one.
Since we no longer use the trash marker and last_write_at + including If-Match during SetBlobMetadata, we do not have these problems now
addToMetadata and removeFromMetadata are still prone to races:
Addressed with the SDK update to include If-Match during SetBlobMetadata
In EmptyTrash, if GetBlobProperties() fails because the data blob does not exist, we should delete the trash marker. Otherwise, we'll keep retrying (and logging failures) indefinitely, which seems pointless.
Got rid off the trash marker altogether per our discussion
checkTrashed still seems weird. It says "Return NotFoundError if trash marker is found on the block". Shouldn't it return true, nil in this case? That seems to be the way its return values are interpreted by callers.
Updated it
v.translateError(NotFoundError) should be just NotFoundError. Generally, (*AzureBlobVolume)translateError() translates Azure SDK errors: we shouldn't be passing it other errors, like the ones returned by strconv.ParseInt().
Updated it
if trashed == true should be just if trashed
Done
It seems like nearly every function that uses checkTrashed() already does some other Azure API call that returns metadata, which seems wasteful. E.g., in Mtime:
Got rid of the addTo and removeFrom metadata methods and simplified to reuse metadata from the first request as much as possible
go vet shows some missing arguments for Printf()
Corrected these
Updated by Tom Clegg over 8 years ago
This looks better, thanks.
The purpose of the "touch" metadata is to ensure the Last-Modified timestamp really gets updated in Touch() even if the storage service has "don't update Last-Modified if an update does not result in any modifications" logic. It isn't meaningful, and it's the only metadata that could be overwritten during Trash() (right?) so Trash() could skip the GetBlobMetadata call, and just set metadata to map[string]string{"expires_at": fmt.Sprintf(...)}
- We should count
blocksInTrash++
only for blocks that are trash, not all blocks. - We should not need to call GetBlobProperties or GetBlobMetadata to learn whether a block is trash. Instead, we should use
Include: "metadata"
in ListBlobsParameters, and just rely on the metadata and properties given in the list response. - We should fix our stub server so (like the real server) it doesn't return metadata in a ListBlobs response unless asked. That (plus better testing, see below) should reveal that we forgot to use
Include: "metadata"
in IndexTo, too, which means we will never get any metadata and we will erroneously include trash in the index response.
Why add trashLifetime here? Shouldn't we delete it as soon as it expires? (Also, this should be phrased "if not expired { continue }" like the conditions before it.)
if expiresAt <= time.Now().Add(-1*trashLifetime).Unix()
testDeleteOldBlock looks like it's missing a || !os.IsNotExist(err)
...
_, err := v.Mtime(TestHash) if err == nil { t.Fatalf("os.IsNotExist(%v) should have been true", err)
It looks like we aren't testing that a trashed (but not yet deleted) block is omitted by IndexTo. We could do this in testTrashEmptyTrashUntrash's checkGet func, something like this:
n, err := v.Get(TestHash, buf) if err != nil { if err2 := v.Touch(TestHash); err2 == nil || !os.IsNotExist(err2) { t.Fatalf("Get returned %q but Touch returned %q", err, err2) } // (try Mtime and IndexTo) return err }
Updated by Radhika Chippada over 8 years ago
... Trash() could skip the GetBlobMetadata call, and just set metadata to map[string]string{"expires_at": fmt.Sprintf(...)}
Skipped the GetBlobMetadata (which I earlier added for last-write-at)
In EmptyTrash() - We should count blocksInTrash++ only for blocks that are trash, not all blocks.
Corrected this
In EmptyTrash() - We should not need to call GetBlobProperties or GetBlobMetadata to learn whether a block is trash. Instead, we should use Include: "metadata" in ListBlobsParameters, and just rely on the metadata and properties given in the list response.
Oooops. The whole point of updating the Azure lib was for this!! Corrected this
We should fix our stub server so (like the real server) it doesn't return metadata in a ListBlobs response unless asked. That (plus better testing, see below) should reveal that we forgot to use Include: "metadata" in IndexTo, too, which means we will never get any metadata and we will erroneously include trash in the index response.
Updated our stub to honor Include parameter
Why add trashLifetime here? Shouldn't we delete it as soon as it expires? (Also, this should be phrased "if not expired { continue }" like the conditions before it.)
Updated this
testDeleteOldBlock looks like it's missing a || !os.IsNotExist(err) ...
The translateError in S3Volume is not checking for "Not Found" in error and the test fails for S3 volumes. I can update translateError for S3Volume to do more if acceptable.
It looks like we aren't testing that a trashed (but not yet deleted) block is omitted by IndexTo.
The checkGet func also is doing IndexTo (line 803).
Updated by Tom Clegg over 8 years ago
bytesDeleted and bytesInTrash are reported, but aren't actually being counted.
This should be phrased "if expiresAt > time.Now().Unix() { continue }" like the conditions before it, instead of indenting the remainder of the loop:
+ if expiresAt <= time.Now().Unix() {
+ err = v.bsClient.DeleteBlob(v.containerName, b.Name, map[string]string{
+ "If-Match": b.Properties.Etag,
+ })
+ if err != nil {
+ log.Printf("EmptyTrash: DeleteBlob(%v): %v", b.Name, err)
+ continue
+ }
+ blocksDeleted++
+ }
In the stub server code, in case r.Method == "PUT" && r.Form.Get("comp") == "metadata":
, the statement blob.Metadata = make(map[string]string)
should not be removed. The Azure server replaces the entire metadata map rather than merging the new values with existing values (right?), so the stub should do the same.
Comment on checkTrashed should be updated (says "Return NotFoundError if..." should be "Return true if...")
This seems unnecessarily verbose, suggest
- expiresAtMetadata := b.Metadata["expires_at"]
- if expiresAtMetadata == "" {
+ if b.Metadata["expires_at"] == "" {
continue
}
I think we should fix the S3 error, if it's as easy as it sounds.
Updated by Radhika Chippada over 8 years ago
bytesDeleted and bytesInTrash are reported, but aren't actually being counted.
Corrected this
This should be phrased "if expiresAt > time.Now().Unix() { continue }"
Updated
In the stub server code, in case r.Method "PUT" && r.Form.Get("comp") "metadata":, the statement blob.Metadata = make(map[string]string) should not be removed.
Updated
Comment on checkTrashed should be updated (says "Return NotFoundError if..." should be "Return true if...")
Updated
This seems unnecessarily verbose, suggest
Updated
I think we should fix the S3 error, if it's as easy as it sounds.
Updated this and updated the testDeleteOldBlock test.
Updated by Tom Clegg over 8 years ago
LGTM, thanks. This branch contains a fair bit of reversed changes so I suggest squashing it into one commit. The easiest way seems to be
cd path_to_arvados git fetch git checkout -b 8556-azure-trash origin/master git diff origin/master...8556-trash-untrash-azure-volume | patch -p1 git commit
This should give you a new branch called 8556-azure-trash with all the same changes as the original branch, but all in one commit, and based on current master.
Updated by Radhika Chippada over 8 years ago
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|commit:aa1d6f9c5e1e21ceedf855bdb2f6db9154f26669.