Project

General

Profile

Actions

Feature #7159

closed

[Keep] Implement an Azure blob storage volume in keepstore

Added by Brett Smith over 9 years ago. Updated about 9 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Keep
Target version:
Story points:
0.5

Description

Functional requirements

  • You can run an Arvados cluster where all Keep blocks are stored as Azure blobs.
  • Keepstore accepts PUT requests and saves the block as an Azure blob. The response includes an X-Keep-Replicas-Stored header that returns the redundancy level of the blob.
    • Ideally this would be introspected from the storage account. If that's too difficult, it's okay to let the administrator set the redundancy level themselves. If that's too difficult, it's okay to hardcode 3 as the value, since that's the lowest redundancy level of any Azure blob.
  • Keepstore accepts and serves GET requests for blocks that are stored as Azure blobs.

Implementation

Write an Azure blob storage volume in keepstore.
  • Add an AzureBlobVolume type in azure_blob_volume.go.
  • Extend (*volumeSet)Set() to accept an argument like "azure-blob:XYZ" where XYZ is a container name.
  • Add an -azure-storage-connection-string flag that accepts a string argument and works like flagReadonly: i.e., it applies to all subsequent -volume azure-blob:XYZ arguments. If the argument starts with "/" or ".", use the first line of the given file, otherwise use the literal argument.

It should be possible to run keepstore with both azure and local storage devices enabled. (This might only be useful when one or the other is configured as read-only.)

Outstanding issues to investigate

  • We're assuming we can save and retrieve blobs using their checksum as their name. Are there any obstacles to this?
    • Seems fine according to MS docs. "acb/acbd1234..." is also an option.
  • Are there any limitations to the number of blobs that can be stored in a bucket? If so, keepstore needs to be able to find blocks across multiple buckets, and may need to have the capability to create buckets if the limit is low enough or we can't find a good predetermined division of buckets.
    • Seems fine. "An account can contain an unlimited number of containers. A container can store an unlimited number of blobs."
  • Are there performance characteristics like "container gets slow if you don't use some sort of namespacing", like ext4? I.e., should we name blobs "acb/acbd1234..." like we do in UnixVolume, or just "acbd1234..."?
    • listBlobsSegmentedWithPrefix seems to do exactly what IndexTo needs, which is handy.
  • How will we store "time of most recent PUT" timestamps? "setBlobProperties" seems relevant, but is "index" going to be unusably slow if we have to call getBlobProperties once per blob?
  • How will we resolve race conditions like "data manager deletes an old unreferenced block at the same time a client PUTs a new copy of it"? Currently we rely on flock(). "Lease" seems to be the relevant Azure feature.
  • Is "write a blob" guaranteed to be atomic (and never write a partial file) or do we still need the "write and rename into place" approach we use in UnixVolume?
Refs

Subtasks 6 (0 open6 closed)

Task #7484: Deal with race in CreateBlobResolvedTom Clegg08/28/2015Actions
Task #7501: Adjust Azure mtimes to local timeClosedTom Clegg08/28/2015Actions
Task #7500: Address golint complaintsResolvedTom Clegg10/09/2015Actions
Task #7416: Review 7159-empty-blob-raceResolvedPeter Amstutz08/28/2015Actions
Task #7417: Document how all the issues raised on 7159 were addressedResolvedTom Clegg08/28/2015Actions
Task #7536: Review 7159-clean-index (also fixes #7168)ResolvedPeter Amstutz08/28/2015Actions

Related issues 6 (0 open6 closed)

Related to Arvados - Bug #7161: [SDKs] PySDK supports any Keep service type, using proxy replication logic for non-disk typesResolvedRadhika Chippada09/23/2015Actions
Related to Arvados - Bug #7162: [SDKs] GoSDK supports any Keep service type, using proxy replication logic for non-disk typesResolvedRadhika Chippada08/28/2015Actions
Related to Arvados - Bug #7167: [Deployment] Write an efficient Keep migration scriptResolvedRadhika Chippada09/30/2015Actions
Related to Arvados - Idea #7179: [Keep] One set of keepstore volume tests should test all volume typesResolvedRadhika Chippada09/02/2015Actions
Blocks Arvados - Bug #7160: [Documentation] Document how to deploy Keep backed by Azure blob storageResolvedTom Clegg08/28/2015Actions
Blocked by Arvados - Idea #7241: [Keep] Prototype Azure blob storageResolvedTom Clegg09/23/2015Actions
Actions #1

Updated by Brett Smith over 9 years ago

Actions #2

Updated by Tom Clegg over 9 years ago

  • Description updated (diff)
Actions #3

Updated by Tom Clegg over 9 years ago

  • Description updated (diff)
Actions #4

Updated by Brett Smith over 9 years ago

  • Target version changed from Arvados Future Sprints to 2015-09-30 sprint
Actions #5

Updated by Peter Amstutz over 9 years ago

Azure storage computes content MD5 and provides it base64 encoded (but I don't think you can query by it)

Actions #6

Updated by Brett Smith over 9 years ago

  • Target version deleted (2015-09-30 sprint)
Actions #7

Updated by Brett Smith over 9 years ago

  • Target version set to Arvados Future Sprints
Actions #8

Updated by Brett Smith over 9 years ago

  • Target version changed from Arvados Future Sprints to 2015-10-14 sprint
Actions #9

Updated by Brett Smith over 9 years ago

  • Assigned To set to Tom Clegg
  • Story points set to 0.5

The actual implementation was effectively done in the prototype story #7241, because that went very well. For this sprint, please go over the bullet points, make sure everything is addressed, and make notes as appropriate as to how those issues were addressed. It will likely come in handy when we support additional object stores in the future.

Actions #10

Updated by Peter Amstutz over 9 years ago

  • How will we store "time of most recent PUT" timestamps? "setBlobProperties" seems relevant, but is "index" going to be unusably slow if we have to call getBlobProperties once per blob?
    • The Azure index returns last modified time in the blob properties, so additional API calls are not needed
  • How will we resolve race conditions like "data manager deletes an old unreferenced block at the same time a client PUTs a new copy of it"? Currently we rely on flock(). "Lease" seems to be the relevant Azure feature.
    • Conditional delete using the etag of the most recent version, if the block is written either it will update the etag (so the conditional delete fails) or the delete will happen first (and the block is rewritten)
  • Is "write a blob" guaranteed to be atomic (and never write a partial file) or do we still need the "write and rename into place" approach we use in UnixVolume?
    • The "commit block list" is atomic, we're using a PUT of 64 MiB which is presumably using commit block list underneath.
    • However, I'm not 100% confident that there isn't a race condition in which a zero-length blob is visible for a tiny slice of time.
Actions #11

Updated by Tom Clegg about 9 years ago

  • Status changed from New to In Progress
Actions #12

Updated by Tom Clegg about 9 years ago

Is "write a blob" guaranteed to be atomic (and never write a partial file) or do we still need the "write and rename into place" approach we use in UnixVolume?

Testing with 0.1.20151002220939.f81f84e confirms CreateBlob() is not atomic: while goroutine A is waiting for CreateBlob("foo", "bar") to return, if goroutine B calls GetBlob("foo"), it might get:
  • Error: block does not exist
  • "" (oops, CreateBlob is not atomic)
  • "bar"

So far, no evidence that Commit has a race: i.e., we don't see partial data like "b" or "ba".

Workaround in 7159-empty-blob-race: When getting a blob, if it's empty, pause 1s and retry until it is 15s old before returning it.

Note: 1 second is a long time if the block is small, but this may be fine. The obvious reason why this race would be common is that clients launch two writer threads at once to write the same block, and a single container might be shared by multiple keepstore servers. With this workaround, one thread will run fast and the other will wait 1 second. The one that runs fast will tell the client it has written 3 copies, and the client will abandon the thread that's waiting 1 second.

Actions #13

Updated by Peter Amstutz about 9 years ago

Reviewing 7159-empty-blob-race:

  • Because the current configuration limits the entire transfer time to 20 seconds, a 15 second window to wait for a block to show up will only leave 5 seconds to transfer the entire block. That seems like it could be a problem.
  • Do you have a sense for how long the race window is? Does the empty blob appear at the beginning of the transfer?
Actions #14

Updated by Tom Clegg about 9 years ago

Peter Amstutz wrote:

Reviewing 7159-empty-blob-race:

  • Because the current configuration limits the entire transfer time to 20 seconds, a 15 second window to wait for a block to show up will only leave 5 seconds to transfer the entire block. That seems like it could be a problem.

True. I don't think there's any way around this, other than "keep services should come with recommended timeout". Until then, fwiw 15 seconds is the maximum time we're willing to wait, but the actual time we wait is (at most) the actual race window plus 1 second granularity.

  • Do you have a sense for how long the race window is? Does the empty blob appear at the beginning of the transfer?

Not really. Probably a good idea to add logging here so we can find out...

Actions #15

Updated by Tom Clegg about 9 years ago

Resolved issues

Are there performance characteristics like "container gets slow if you don't use some sort of namespacing", like ext4? I.e., should we name blobs "acb/acbd1234..." like we do in UnixVolume, or just "acbd1234..."?

Blocks are named just "acbd18db4cc2f85cedef654fccc4a4d8", with no prefix like "acb/acbd18...". Judging by the API and the docs, there is nothing special about the "/" character so we're assuming there's no performance benefit to "acb/acbd18..." like there is in POSIX filesystems.

The Azure API has a "list blobs with given prefix" which allows us to implement our own "index with prefix" efficiently without a directory-like structure.

number of blobs that can be stored in a bucket container

Docs say: "An account can contain an unlimited number of containers. A container can store an unlimited number of blobs."

is "index" going to be unusably slow if we have to call getBlobProperties once per blob?

The ListBlobs API response already includes the Properties for each blob, so no, we don't need an extra API call per listed blob.

Is "write a blob" guaranteed to be atomic (and never write a partial file) or do we still need the "write and rename into place" approach we use in UnixVolume?

There is an atomic commit operation, but we haven't found an atomic create+commit. The create+commit API seems to consist of an atomic create followed by an atomic commit.
  • It's common to see an empty blob while a create+commit API call is in progress.
  • If create+commit fails, the blob gets deleted.

The Azure documentation claims the API obeys "If-Unmodified-Since" request headers, but the API doesn't appear to do so. See "if false" part of test case: https://github.com/curoverse/azure-sdk-for-go/blob/886dc43df1112ee01f61f75dc4211fb898c04339/storage/blob_test.go#L324-L335 Our workaround uses "If-Match" (which really is obeyed by Azure) during Delete, relying on Touch to change Etag by storing the current time in blob metadata.

Outstanding issues

Race

Confirm that data cannot be lost in the following race:
  • A sees no existing data, starts a create+commit
  • B sees no existing data, starts a create+commit
  • A's create+commit succeeds, A returns success to caller
  • B's create+commit fails between create and commit (e.g., network error in transit)

In this kind of race, if A has already had a chance to return success to caller, B must not truncate or delete the blob. Currently we rely on Azure not to hit "delete" to clean up B's transaction (as it normally would if upload fails) if someone else has committed data to the blob in the meantime -- regardless of whether it was A or B that succeeded in the "create" call.

Timeouts

If two writers send the same data block to a container (through the same keepstore services or not) one of them will win the race and just do a "create+commit" operation. The loser will wait for the winner to finish committing, then read back the block to make sure the block data (not just the hash) is equal. This means the losing client waits for two transfers (write + read), and is therefore more likely to reach its (default 20 second) timeout.

Conditional delete

We use a fork of the official Azure Go SDK because the official version doesn't support conditional delete using the "If-Match" header. There is an open issue tagged "help wanted". Once someone (us?) adds support to the official SDK and we upgrade to Go1.5, we should consider using Go1.5's vendor experiment instead of importing from our own github fork.

Actions #16

Updated by Tom Clegg about 9 years ago

  • Tracker changed from Bug to Feature

Here are some performance numbers obtained with keepexercise @ c8b37d5. Note: testing was not especially rigorous.

Platform Volume Runtime Reading threads Writing threads #different blocks Net read MiB/s Net write MiB/s
AWS xfs 1m30s 16 4 1 20.6 6.4
AWS xfs 1m30s 16 4 4 18.5 7.8
AWS xfs 2m 4 16 16 6.9 9.6
Azure xfs 10m 16 4 1 92.2 46.5
Azure xfs 2m 16 4 4 92.5 52.1
Azure xfs 5m 4 16 16 43.9 55.5
Azure Blob 7m 16 4 1 68.1 15.8
Azure Blob 3m 16 4 4 78.9 23.1
Azure Blob 3m 4 16 16 22.8 51.2
Actions #17

Updated by Peter Amstutz about 9 years ago

7159-clean-index LGTM

Actions #18

Updated by Peter Amstutz about 9 years ago

Tom Clegg wrote:

Here are some performance numbers obtained with keepexercise @ c8b37d5. Note: testing was not especially rigorous.

Platform Volume Runtime Reading threads Writing threads #different blocks Net read MiB/s Net write MiB/s
AWS xfs 1m30s 16 4 1 20.6 6.4
AWS xfs 1m30s 16 4 4 18.5 7.8
AWS xfs 2m 4 16 16 6.9 9.6
Azure xfs 10m 16 4 1 92.2 46.5
Azure xfs 2m 16 4 4 92.5 52.1
Azure xfs 5m 4 16 16 43.9 55.5
Azure Blob 7m 16 4 1 68.1 15.8
Azure Blob 3m 16 4 4 78.9 23.1
Azure Blob 3m 4 16 16 22.8 51.2

I'm confused. How is it that AWS has the shortest Runtime but also the lowest transfer rate?

Actions #19

Updated by Tom Clegg about 9 years ago

Peter Amstutz wrote:

I'm confused. How is it that AWS has the shortest Runtime but also the lowest transfer rate?

The test stopped when I happened to notice the transfer rate was more or less stable (or just decided the test wasn't getting any more interesting) and pressed ^C -- there wasn't a "total bytes transferred" target or anything like that.

Actions #20

Updated by Tom Clegg about 9 years ago

  • Description updated (diff)
Actions #21

Updated by Tom Clegg about 9 years ago

  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF