Feature #14259
closed[SDK] Python collection class uses copy remote block to local keepstore
Description
Python SDK Collection class uses "refresh keep signature" API (#14199) to convert +R token signatures into +A signatures.
Only supports copying block from remote clusters to home cluster. Only supports creating collections on the local cluster.
The Python keep client needs a new method to access the "refresh keep signature" API.
The Collection class will check if ArvadosFile objects have any locators with remote signatures (+R) and uses the refresh keep signature API before creating / updating a collection on the home cluster (Collection.save() operation).
Related issues
Updated by Peter Amstutz about 6 years ago
- Status changed from New to In Progress
Updated by Peter Amstutz about 6 years ago
- Related to Feature #14199: [keepstore] copy block from remote keepstore to local keepstore added
Updated by Peter Amstutz about 6 years ago
- Status changed from In Progress to New
Updated by Ward Vandewege about 6 years ago
- Target version changed from To Be Groomed to 2018-10-17 sprint
Updated by Lucas Di Pentima about 6 years ago
- Assigned To set to Lucas Di Pentima
Updated by Lucas Di Pentima about 6 years ago
- Status changed from New to In Progress
Updated by Lucas Di Pentima about 6 years ago
- Target version changed from 2018-10-17 sprint to 2018-10-31 sprint
Updated by Peter Amstutz about 6 years ago
It just occurred to me that this ticket needs a Go SDK counterpart, because crunch-run in some circumstances will copy block locators directly from input to output, so it also needs to support the translation from +R to +A.
Updated by Peter Amstutz about 6 years ago
- Related to Feature #14406: [SDK] Go collection uses copy remote block to local keepstore added
Updated by Lucas Di Pentima about 6 years ago
I'm making some manual testing on the keep client side of this story. First tried using 9tee4 & c97qk to get the block signature "translation" and didn't work (after fixing 9tee4's config and being able to get a remote collection from the pysdk), so I tried running 2 federated arvboxes, with the same result.
With the home cluster token I run curl to get a remote cluster block's signature translated and get this error:
$ curl -I http://172.17.0.2:25108/6386dee2604bf10d89895a9896e07c04+1962+R1cnct-568e0ea34658171942c3c67cfb76b908dbe56335@5be5c162 -H 'Authorization: Bearer v2/34se5-gj3su-ux1dq4873gz33fc/58f5e80t81btxoo4ga76y1xqo7pq96b2acnjqomjg7oth16isk' -H 'X-Keep-Signature: local, 2018-10-26T12:42:52.362-03:00' HTTP/1.1 502 Bad Gateway Content-Type: text/plain; charset=utf-8 Vary: X-Keep-Signature X-Content-Type-Options: nosniff Date: Fri, 26 Oct 2018 17:37:19 GMT Content-Length: 30
On the local cluster keep's log:
2018-10-26_17:37:19.16470 {"RequestID":"req-1lsuugpysf8031ovr8g4","level":"info","msg":"response","remoteAddr":"172.17.0.1:57754","reqBytes":0,"reqForwardedFor":"","reqHost":"172.17.0.2:25108","reqMethod":"HEAD","reqPath":"6386dee2604bf10d89895a9896e07c04+1962+R1cnct-568e0ea34658171942c3c67cfb76b908dbe56335@5be5c162","reqQuery":"","respBytes":30,"respStatus":"Bad Gateway","respStatusCode":502,"time":"2018-10-26T17:37:19.164496983Z","timeToStatus":0.028260,"timeTotal":0.028415,"timeWriteBody":0.000155}
Am I missing something?
Updated by Lucas Di Pentima about 6 years ago
Both config.yml files seem to be generated correctly:
$ cat ~/.arvbox/cluster1/var/cluster_config.yml Clusters: 34se5: NodeProfiles: '*': arvados-api-server: {Listen: ':8004'} arvados-controller: {Listen: ':8003'} PostgreSQL: Connection: {DBName: arvados_development, Host: localhost, Password: 55m929avm21jg5kg7ezhlueex, User: arvados, client_encoding: utf8} ConnectionPool: 32 RemoteClusters: 1cnct: {Host: '172.17.0.3:8000', Insecure: true, Proxy: true} 34se5: {Host: '172.17.0.2:8000', Insecure: true, Proxy: true} $ cat ~/.arvbox/cluster2/var/cluster_config.yml Clusters: 1cnct: NodeProfiles: '*': arvados-api-server: {Listen: ':8004'} arvados-controller: {Listen: ':8003'} PostgreSQL: Connection: {DBName: arvados_development, Host: localhost, Password: eh91xfa6og4ciyoayyso6rfbk, User: arvados, client_encoding: utf8} ConnectionPool: 32 RemoteClusters: 1cnct: {Host: '172.17.0.3:8000', Insecure: true, Proxy: true} 34se5: {Host: '172.17.0.2:8000', Insecure: true, Proxy: true}
Updated by Lucas Di Pentima about 6 years ago
I found the issue, keepstores weren't restarted after editing the override config files.
Updated by Lucas Di Pentima about 6 years ago
Updates at 4c669f3d7 - branch 14259-pysdk-remote-block-copy
Test run: https://ci.curoverse.com/job/developer-run-tests/953/
KeepClient.refresh_signature()
makes a HEAD request to copy remote blocks to local keep storage.Collection
class usesrefresh_signature()
when saving a local collection (or creating a new one) with remote blocks.
Updated by Peter Amstutz about 6 years ago
- Assigned To changed from Lucas Di Pentima to Joshua Randall
Lucas Di Pentima wrote:
Updates at 4c669f3d7 - branch
14259-pysdk-remote-block-copy
Test run: https://ci.curoverse.com/job/developer-run-tests/953/
KeepClient.refresh_signature()
makes a HEAD request to copy remote blocks to local keep storage.Collection
class usesrefresh_signature()
when saving a local collection (or creating a new one) with remote blocks.
if loop.success(): if method == "HEAD": return blob or True else: return blob
- I think when loop.success() is true
blob
cannot be None, so I don't know if the special case for HEAD or the 'or' clause are necessary? - I noticed a caching bug. It looks like it saves the result of HEAD operations in the cache. That means if you called head() and then called get() with the same locator, you would get back
True
or the signed locator, but not the expected blob contents. Please write a test to check this case. If you can confirm the bug, a possible fix is to not cache the results of HEAD requests. - There is a request timeout and a "low speed timeout" that will kill the request if the transfer time or rate exceeds a certain threshold. However, HEAD requests don't transfer anything, and block until the data is transferred by the keep server. Please make sure it won't fail if the keep-to-keep transfer is slow.
- _copy_remote_blocks() only needs to go through the collection contents once, and can use self.items().
- I'm a little concerned about the potential overhead of calling _copy_remote_blocks on every collection save, when in the common case we don't have any remote signatures. I wonder if you could do some benchmarking to get a sense of how expensive this extra check is when saving very large collections.
Updated by Lucas Di Pentima about 6 years ago
Merged to master to unblock federation, while continuing to polish the following topics:
Peter Amstutz wrote:
- I think when loop.success() is true
blob
cannot be None, so I don't know if the special case for HEAD or the 'or' clause are necessary?
Yes, I think you're right. On the other kind of HEAD requests the returned value was True so there's no point to return True or True
.
- I noticed a caching bug. It looks like it saves the result of HEAD operations in the cache. That means if you called head() and then called get() with the same locator, you would get back
True
or the signed locator, but not the expected blob contents. Please write a test to check this case. If you can confirm the bug, a possible fix is to not cache the results of HEAD requests.
We shouldn't get the blob contents on any HEAD request, right? I think we just use 'blob
' var name because it's a special case of a GET request. OTOH, I think caching HEAD requests would be beneficial for remote locator translations, right now I avoid requesting a copy operation of the same block more than once by using that remote_blocks
dict, but if that gets controlled at keep client level, it may be useful to simplify the Collection class' code, do you agree?
- There is a request timeout and a "low speed timeout" that will kill the request if the transfer time or rate exceeds a certain threshold. However, HEAD requests don't transfer anything, and block until the data is transferred by the keep server. Please make sure it won't fail if the keep-to-keep transfer is slow.
Ok, I suppose timeouts should still be enforced, but not inexistent transfer rates. I'll check that.
- _copy_remote_blocks() only needs to go through the collection contents once, and can use self.items().
Right, I'll fix it so we do only one pass, thanks.
- I'm a little concerned about the potential overhead of calling _copy_remote_blocks on every collection save, when in the common case we don't have any remote signatures. I wonder if you could do some benchmarking to get a sense of how expensive this extra check is when saving very large collections.
How about checking on local collections if the manifest_text
contains '+R[a-z]{5}-' and only then call _copy_remote_blocks()
?
Updated by Peter Amstutz about 6 years ago
Lucas Di Pentima wrote:
- I noticed a caching bug. It looks like it saves the result of HEAD operations in the cache. That means if you called head() and then called get() with the same locator, you would get back
True
or the signed locator, but not the expected blob contents. Please write a test to check this case. If you can confirm the bug, a possible fix is to not cache the results of HEAD requests.We shouldn't get the blob contents on any HEAD request, right? I think we just use '
blob
' var name because it's a special case of a GET request. OTOH, I think caching HEAD requests would be beneficial for remote locator translations, right now I avoid requesting a copy operation of the same block more than once by using thatremote_blocks
dict, but if that gets controlled at keep client level, it may be useful to simplify the Collection class' code, do you agree?
The variable is called blob
but it holds whatever was returned by the get()
method. If you want to cache HEAD responses, you should add a "signed_locator" field to CacheSlot and have slot.set() take both 'contents' and the response in X-Keep-Locator. The potential bug I want to avoid is when calling head() followed by get() you don't want the get() method returning the locator when it was supposed to be returning the block contents.
- I'm a little concerned about the potential overhead of calling _copy_remote_blocks on every collection save, when in the common case we don't have any remote signatures. I wonder if you could do some benchmarking to get a sense of how expensive this extra check is when saving very large collections.
How about checking on local collections if the
manifest_text
contains '+R[a-z]{5}-' and only then call_copy_remote_blocks()
?
I think that might be a good idea, but a little bit of benchmarking would let us know for sure. Make sure to use the attribute self._manifest_text, not the manifest_text() method. If the collection wasn't initialized from a manifest the attribute will probably be None, so then you have to call _copy_remote_blocks() to be safe.
Thanks!
Updated by Lucas Di Pentima about 6 years ago
- Assigned To changed from Joshua Randall to Lucas Di Pentima
- Target version changed from 2018-10-31 sprint to 2018-11-14 Sprint
Updated by Lucas Di Pentima about 6 years ago
Updates at b8addea7d6c8c464a2743191561470332eeaba13
Test run: https://ci.curoverse.com/job/developer-run-tests/960/ (pending)
Addressed suggestions. My idea of matching a remote block locator against col._manifest_text
won't work because sometimes that attribute isn't updated while having remote blocks. I'm in the process of gathering performance data, it's taking me a little time to build the testing environment. In the meantime the updates can be reviewed.
Updated by Lucas Di Pentima about 6 years ago
After fighting too long with arvbox to set a interestingly large collection, I changed strategies making a script that builds a manifest with lots of empty files, creates the collection with save_new()
and measure the save()
call time when already committed, giving me these results:
- Without
_copy_remote_blocks()
: ~2.2e-5 secs (without regards of number of files) - With
_copy_remote_blocks()
: ~1,2e-5 secs per file on the collection (totally unacceptable).
(tested with 10K, 100K & 1M files)
So this is a clear indication that at least on those committed collections the remote block scanning should be avoided.
I'm working on optimizations.
Updated by Lucas Di Pentima about 6 years ago
Optimized the use of _copy_remote_blocks()
at c3b2675
Test run: https://ci.curoverse.com/job/developer-run-tests/967/
Updated by Peter Amstutz about 6 years ago
Lucas Di Pentima wrote:
Optimized the use of
_copy_remote_blocks()
at c3b2675
Test run: https://ci.curoverse.com/job/developer-run-tests/967/
Keeping a flag seems like the right strategy.
- has_remote_blocks() should check the _has_remote_blocks flag
- Both Collection.copy() and Collection.rename() call Collection.add(), so the has_remote_blocks() check should go there.
- you should add the file part of _copy_remote_blocks() to ArvadosFile, then you don't need to do isinstance(). In addition you could access the file segments list directly (so you don't incur the overhead of a copy from calling segments()) and would be a bit cleaner for updating the segment locator directly.
Updated by Lucas Di Pentima about 6 years ago
Updates at da53a8d80
Test run: https://ci.curoverse.com/job/developer-run-tests/973/
Addressed above suggestions.
As for the second bulletpoint, I did it that way at first because the add()
call is done on the target_dir
, that may or may not be the current collection. I've now added a recursive method to set _has_remote_blocks
all the way up to the root collection.
Updated by Peter Amstutz about 6 years ago
Lucas Di Pentima wrote:
Updates at da53a8d80
Test run: https://ci.curoverse.com/job/developer-run-tests/973/Addressed above suggestions.
As for the second bulletpoint, I did it that way at first because the
add()
call is done on thetarget_dir
, that may or may not be the current collection. I've now added a recursive method to set_has_remote_blocks
all the way up to the root collection.
This LGTM.
Updated by Lucas Di Pentima about 6 years ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|a698257bb62e04201acd1d3f9b6edd094296d5c0.