Feature #10990

[keep-web] Add support for HTTP Range requests

Added by Tom Morris almost 3 years ago. Updated almost 3 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
01/26/2017
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-

Description

Currently keep-web only supports range requests which begin at byte 0. It's desirable for things like genome viewers accessing BAM files via their .bai indexes to be able to do arbitrary byte range addressing to keep-web hosted content / URLs.

An example URL which could be used for testing:

http://www.broadinstitute.org/igv/projects/current/igv.php?sessionURL=https://collections.su92l.arvadosapi.com/c=0c4d584d89d5b672d281adb8140d6fa9-22756/HW67C49-EXT.chr11.bam&genome=hg19&locus=chr11:68562278-68562378

A version of the URL including a token in the path (/t=xxx) should work as well.


Subtasks

Task #10997: More unit testsResolvedTom Clegg

Task #10992: Review 10990-keep-web-rangesResolvedPeter Amstutz


Related issues

Related to Arvados - Feature #10998: [keep-web] configurable block cache sizeResolved08/13/2019

Associated revisions

Revision 827879be
Added by Tom Clegg almost 3 years ago

Merge branch '10990-keep-web-ranges'

refs #10990

Revision 519ab4c8
Added by Tom Clegg almost 3 years ago

Merge branch '10990-keep-web-ranges'

closes #10990

History

#1 Updated by Tom Morris almost 3 years ago

  • Story points set to 1.0

#2 Updated by Peter Amstutz almost 3 years ago

Implementation plan:

Make CollectionFileReader seekable.

Enable keep-web to handle range requests at nonzero offset by seeking before starting to read.

#3 Updated by Peter Amstutz almost 3 years ago

  • Assigned To changed from Peter Amstutz to Tom Clegg

#4 Updated by Tom Clegg almost 3 years ago

  • Status changed from New to In Progress

#5 Updated by Tom Morris almost 3 years ago

  • Description updated (diff)

#6 Updated by Tom Clegg almost 3 years ago

10990-keep-web-ranges @ 24b137a5b3313778e2db7f5d1e0c82daf0634a9c

Known issue: Seek() should get more unit tests (seeking forward, seeking across block boundaries, illegal seeks).

#7 Updated by Peter Amstutz almost 3 years ago

Tom Clegg wrote:

10990-keep-web-ranges @ 24b137a5b3313778e2db7f5d1e0c82daf0634a9c

Known issue: Seek() should get more unit tests (seeking forward, seeking across block boundaries, illegal seeks).

Comments:

CollectionFileReader:

  • Much easier to follow than the code it is replacing. I appreciate that the approach is more consistent with the approach taken by the Python SDK.
  • I suggest adding the file offset to manifest.FileSegment. Then you can determine the correct file segment for a given offset using a binary search instead of a linear search. For typical sequential reads, an even simpler optimization would be to check f.segNext before searching for the correct segment from the beginning.
  • I like the optimization of only enabling prefetching when reading past the first 1 MiB of the file. We should do something similar in the Python SDK.
  • What happens if you call Read() after Close()? I'm not sure if the behavior of len(nil) is to return 0 or panic?
  • FYI we found in Python that if a caller does read(N) and gets less than N bytes, that's sometimes interpreted as EOF (!) so we had to change the behavior to make a best effort to return the number of bytes asked for. Go probably doesn't have this problem, but I wanted to mention it.

BlockCache

  • Is there a reason to cap limit the cache based on blocks instead of bytes? (the Python SDK uses bytes) The cache could be considered filled by four 1 byte blocks.
  • We could stream results through instead of adding another store-and-forward stage by using streamer.AsyncStream in the BlockCache. This holds a buffer from which you can create multiple readers. Readers can start reading before the buffer is completely filled.

#8 Updated by Peter Amstutz almost 3 years ago

Peter Amstutz wrote:

  • We could stream results through instead of adding another store-and-forward stage by using streamer.AsyncStream in the BlockCache. This holds a buffer from which you can create multiple readers. Readers can start reading before the buffer is completely filled.

I'm working on a branch that implements this.

#9 Updated by Peter Amstutz almost 3 years ago

keep-web:

  • keep-web should have an option to adjust the cache size
  • cache should be (API token)+(hash)

#10 Updated by Tom Clegg almost 3 years ago

Peter Amstutz wrote:

keep-web:

  • keep-web should have an option to adjust the cache size

Agreed

  • cache should be (API token)+(hash)

The only locators keep-web uses are ones that come from the API server in a collection's manifest_text. The API server doesn't give those out unless the caller is allowed to read the data. So if we have a hash to look up, then the caller is allowed to see the data, which means salting the cache key with the token would sabotage cache utilization without providing any benefit.

A client that fetches user-supplied hashes (and handles multiple users in one process) would need to keep the caches separate, though. With the current setup it's possible, but not the default behavior, to give each keepclient its own cache. Perhaps it would be safer to make "each keepclient gets its own cache" the default, and let keep-web override that with a single global cache?

#11 Updated by Peter Amstutz almost 3 years ago

Tom Clegg wrote:

Peter Amstutz wrote:

keep-web:

  • keep-web should have an option to adjust the cache size

Agreed

  • cache should be (API token)+(hash)

The only locators keep-web uses are ones that come from the API server in a collection's manifest_text. The API server doesn't give those out unless the caller is allowed to read the data. So if we have a hash to look up, then the caller is allowed to see the data, which means salting the cache key with the token would sabotage cache utilization without providing any benefit.

I can't think of a way that an attacker would easily trick keep-web into requesting an arbitrary block, so I think you're right. On the other hand, if we choose to also cache collection lookups in the future, I'm pretty sure we would want to tie it to the token.

Related, if we added caching to services that do work with blocks, such as keepproxy, we would definitely want to salt the cache keys.

A client that fetches user-supplied hashes (and handles multiple users in one process) would need to keep the caches separate, though. With the current setup it's possible, but not the default behavior, to give each keepclient its own cache. Perhaps it would be safer to make "each keepclient gets its own cache" the default, and let keep-web override that with a single global cache?

I think a global cache is better if the cache keys incorporate the tokens, but separate caches are better if they don't.

#12 Updated by Peter Amstutz almost 3 years ago

Peter Amstutz wrote:

Peter Amstutz wrote:

  • We could stream results through instead of adding another store-and-forward stage by using streamer.AsyncStream in the BlockCache. This holds a buffer from which you can create multiple readers. Readers can start reading before the buffer is completely filled.

I'm working on a branch that implements this.

See 10990-keep-web-ranges-pa

Passes range request tests.

Would be interesting to set this up on a test server and do some benchmarking of streaming vs store-and-forward.

#13 Updated by Tom Clegg almost 3 years ago

Peter Amstutz wrote:

I can't think of a way that an attacker would easily trick keep-web into requesting an arbitrary block, so I think you're right. On the other hand, if we choose to also cache collection lookups in the future, I'm pretty sure we would want to tie it to the token.

Yes, collection IDs are provided by the client so we'd certainly need to avoid leaking that cache between tokens (e.g., a client who can't read a collection shouldn't even be able to see which files exist, let alone the contents). So the block cache would still be fine.

Related, if we added caching to services that do work with blocks, such as keepproxy, we would definitely want to salt the cache keys.

I think a global cache is better if the cache keys incorporate the tokens, but separate caches are better if they don't.

Surely we can solve this without having multiple copies of the same data in memory. If we have cached data, but aren't sure of the token, we could do HEAD to check permission, then return the cached data...

#14 Updated by Tom Morris almost 3 years ago

Successfully tested the deployed version of this.

#15 Updated by Tom Clegg almost 3 years ago

  • Story points deleted (1.0)

#16 Updated by Tom Clegg almost 3 years ago

  • Target version changed from 2017-02-01 sprint to 2017-02-15 sprint

#17 Updated by Tom Clegg almost 3 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 33 to 100

Applied in changeset arvados|commit:519ab4c83aeb44ce91da941ecc191d00b6c6c72b.

Also available in: Atom PDF