Feature #7159: [Keep] Implement an Azure blob storage volume in keepstore - Arvados

Feature #7159

h2. Functional requirements 

 * You can run an Arvados cluster where all Keep blocks are stored as Azure blobs. 
 * Keepstore accepts PUT requests and saves the block as an Azure blob.    The response includes an X-Keep-Replicas-Stored header that returns the redundancy level of the blob. 
 ** Ideally this would be introspected from the storage account.    If that's too difficult, it's okay to let the administrator set the redundancy level themselves.    If that's too difficult, it's okay to hardcode 3 as the value, since that's the lowest redundancy level of any Azure blob. 
 * Keepstore accepts and serves GET requests for blocks that are stored as Azure blobs. 

 h2. Implementation 

 Write an Azure blob storage volume in keepstore. 
 * Add an @AzureBlobVolume@ type in @azure_blob_volume.go@. 
 * (Ideally) write at least some of the volume tests such that we can use the same set of tests on all volume types (in addition to type-specific edge cases). Doing this for "backdate-and-touch" tests might require a TestableAzureBlobVolume wrapper that has a "backdate locator timestamp to X" function. This should help avoid holes in our test cases ("forgot to test condition X in volume type Y") and make it much faster to add new volume types. 
 * Extend @(*volumeSet)Set()@ to accept an argument like "azure-blob:XYZ" where XYZ is a container name. 
 * Add an @-azure-storage-connection-string@ flag that accepts a string argument and works like flagReadonly: i.e., it applies to all subsequent @-volume azure-blob:XYZ@ arguments. If the argument starts with "/" or ".", use the first line of the given file, otherwise use the literal argument. 

 It should be possible to run keepstore with both azure and local storage devices enabled. (This might only be useful when one or the other is configured as read-only.) 

 h2. Outstanding issues to investigate 

 * We're assuming we can save and retrieve blobs using their checksum as their name.    Are there any obstacles to this? 
 ** Seems fine according to MS docs. "acb/acbd1234..." is also an option. 
 * Are there any limitations to the number of blobs that can be stored in a bucket?    If so, keepstore needs to be able to find blocks across multiple buckets, and may need to have the capability to create buckets if the limit is low enough or we can't find a good predetermined division of buckets. 
 ** Seems fine. "An account can contain an unlimited number of containers. A container can store an unlimited number of blobs." 
 * Are there performance characteristics like "container gets slow if you don't use some sort of namespacing", like ext4? I.e., should we name blobs "acb/acbd1234..." like we do in UnixVolume, or just "acbd1234..."? 
 ** listBlobsSegmentedWithPrefix seems to do exactly what IndexTo needs, which is handy. 
 * How will we store "time of most recent PUT" timestamps? "setBlobProperties" seems relevant, but is "index" going to be unusably slow if we have to call getBlobProperties once per blob? 
 * How will we resolve race conditions like "data manager deletes an old unreferenced block at the same time a client PUTs a new copy of it"? Currently we rely on flock(). "Lease" seems to be the relevant Azure feature. 
 * Is "write a blob" guaranteed to be atomic (and never write a partial file) or do we still need the "write and rename into place" approach we use in UnixVolume? 

 Refs 
 * "How to use Blob storage":https://azure.microsoft.com/en-us/documentation/articles/storage-dotnet-how-to-use-blobs/, last updated 08/04/2015

Back

Project

General

Profile

Arvados

Feature #7159