Feature #14259

[SDK] Python collection class uses copy remote block to local keepstore

Added by Peter Amstutz about 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
10/30/2018
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
2.0
Release:
Release relationship:
Auto

Description

Python SDK Collection class uses "refresh keep signature" API (#14199) to convert +R token signatures into +A signatures.

Only supports copying block from remote clusters to home cluster. Only supports creating collections on the local cluster.

The Python keep client needs a new method to access the "refresh keep signature" API.

The Collection class will check if ArvadosFile objects have any locators with remote signatures (+R) and uses the refresh keep signature API before creating / updating a collection on the home cluster (Collection.save() operation).


Subtasks

Task #14293: Review 14259-pysdk-remote-block-copyResolvedPeter Amstutz


Related issues

Related to Arvados - Feature #14199: [keepstore] copy block from remote keepstore to local keepstoreResolved10/04/2018

Related to Arvados - Feature #14406: [SDK] Go collection uses copy remote block to local keepstoreResolved11/05/2018

Associated revisions

Revision e59b78c7
Added by Lucas Di Pentima about 1 year ago

Merge branch '14259-pysdk-remote-block-copy'
Refs #14259

Arvados-DCO-1.1-Signed-off-by: Lucas Di Pentima <>

Revision a698257b
Added by Lucas Di Pentima about 1 year ago

Merge branch '14259-pysdk-remote-block-copy'
Closes #14259

Arvados-DCO-1.1-Signed-off-by: Lucas Di Pentima <>

History

#1 Updated by Peter Amstutz about 1 year ago

  • Status changed from New to In Progress

#2 Updated by Peter Amstutz about 1 year ago

  • Related to Feature #14199: [keepstore] copy block from remote keepstore to local keepstore added

#3 Updated by Peter Amstutz about 1 year ago

  • Description updated (diff)

#4 Updated by Peter Amstutz about 1 year ago

  • Description updated (diff)

#5 Updated by Peter Amstutz about 1 year ago

  • Description updated (diff)

#6 Updated by Peter Amstutz about 1 year ago

  • Description updated (diff)

#7 Updated by Peter Amstutz about 1 year ago

  • Description updated (diff)

#8 Updated by Peter Amstutz about 1 year ago

  • Description updated (diff)

#9 Updated by Peter Amstutz about 1 year ago

  • Story points set to 2.0

#10 Updated by Peter Amstutz about 1 year ago

  • Status changed from In Progress to New

#11 Updated by Ward Vandewege about 1 year ago

  • Target version changed from To Be Groomed to 2018-10-17 sprint

#12 Updated by Lucas Di Pentima about 1 year ago

  • Assigned To set to Lucas Di Pentima

#13 Updated by Lucas Di Pentima about 1 year ago

  • Status changed from New to In Progress

#14 Updated by Lucas Di Pentima about 1 year ago

  • Target version changed from 2018-10-17 sprint to 2018-10-31 sprint

#15 Updated by Peter Amstutz about 1 year ago

It just occurred to me that this ticket needs a Go SDK counterpart, because crunch-run in some circumstances will copy block locators directly from input to output, so it also needs to support the translation from +R to +A.

#16 Updated by Peter Amstutz about 1 year ago

  • Related to Feature #14406: [SDK] Go collection uses copy remote block to local keepstore added

#17 Updated by Lucas Di Pentima about 1 year ago

I'm making some manual testing on the keep client side of this story. First tried using 9tee4 & c97qk to get the block signature "translation" and didn't work (after fixing 9tee4's config and being able to get a remote collection from the pysdk), so I tried running 2 federated arvboxes, with the same result.

With the home cluster token I run curl to get a remote cluster block's signature translated and get this error:

$ curl -I http://172.17.0.2:25108/6386dee2604bf10d89895a9896e07c04+1962+R1cnct-568e0ea34658171942c3c67cfb76b908dbe56335@5be5c162 -H 'Authorization: Bearer v2/34se5-gj3su-ux1dq4873gz33fc/58f5e80t81btxoo4ga76y1xqo7pq96b2acnjqomjg7oth16isk' -H 'X-Keep-Signature: local, 2018-10-26T12:42:52.362-03:00'
HTTP/1.1 502 Bad Gateway
Content-Type: text/plain; charset=utf-8
Vary: X-Keep-Signature
X-Content-Type-Options: nosniff
Date: Fri, 26 Oct 2018 17:37:19 GMT
Content-Length: 30

On the local cluster keep's log:

2018-10-26_17:37:19.16470 {"RequestID":"req-1lsuugpysf8031ovr8g4","level":"info","msg":"response","remoteAddr":"172.17.0.1:57754","reqBytes":0,"reqForwardedFor":"","reqHost":"172.17.0.2:25108","reqMethod":"HEAD","reqPath":"6386dee2604bf10d89895a9896e07c04+1962+R1cnct-568e0ea34658171942c3c67cfb76b908dbe56335@5be5c162","reqQuery":"","respBytes":30,"respStatus":"Bad Gateway","respStatusCode":502,"time":"2018-10-26T17:37:19.164496983Z","timeToStatus":0.028260,"timeTotal":0.028415,"timeWriteBody":0.000155}

Am I missing something?

#18 Updated by Lucas Di Pentima about 1 year ago

Both config.yml files seem to be generated correctly:

$ cat ~/.arvbox/cluster1/var/cluster_config.yml
Clusters:
  34se5:
    NodeProfiles:
      '*':
        arvados-api-server: {Listen: ':8004'}
        arvados-controller: {Listen: ':8003'}
    PostgreSQL:
      Connection: {DBName: arvados_development, Host: localhost, Password: 55m929avm21jg5kg7ezhlueex,
        User: arvados, client_encoding: utf8}
      ConnectionPool: 32
    RemoteClusters:
      1cnct: {Host: '172.17.0.3:8000', Insecure: true, Proxy: true}
      34se5: {Host: '172.17.0.2:8000', Insecure: true, Proxy: true}

$ cat ~/.arvbox/cluster2/var/cluster_config.yml
Clusters:
  1cnct:
    NodeProfiles:
      '*':
        arvados-api-server: {Listen: ':8004'}
        arvados-controller: {Listen: ':8003'}
    PostgreSQL:
      Connection: {DBName: arvados_development, Host: localhost, Password: eh91xfa6og4ciyoayyso6rfbk,
        User: arvados, client_encoding: utf8}
      ConnectionPool: 32
    RemoteClusters:
      1cnct: {Host: '172.17.0.3:8000', Insecure: true, Proxy: true}
      34se5: {Host: '172.17.0.2:8000', Insecure: true, Proxy: true}

#19 Updated by Lucas Di Pentima about 1 year ago

I found the issue, keepstores weren't restarted after editing the override config files.

#20 Updated by Lucas Di Pentima about 1 year ago

Updates at 4c669f3d7 - branch 14259-pysdk-remote-block-copy
Test run: https://ci.curoverse.com/job/developer-run-tests/953/

  • KeepClient.refresh_signature() makes a HEAD request to copy remote blocks to local keep storage.
  • Collection class uses refresh_signature() when saving a local collection (or creating a new one) with remote blocks.

#21 Updated by Peter Amstutz about 1 year ago

  • Assigned To changed from Lucas Di Pentima to Joshua Randall

Lucas Di Pentima wrote:

Updates at 4c669f3d7 - branch 14259-pysdk-remote-block-copy
Test run: https://ci.curoverse.com/job/developer-run-tests/953/

  • KeepClient.refresh_signature() makes a HEAD request to copy remote blocks to local keep storage.
  • Collection class uses refresh_signature() when saving a local collection (or creating a new one) with remote blocks.
            if loop.success():
                if method == "HEAD":
                    return blob or True
                else:
                    return blob
  • I think when loop.success() is true blob cannot be None, so I don't know if the special case for HEAD or the 'or' clause are necessary?
  • I noticed a caching bug. It looks like it saves the result of HEAD operations in the cache. That means if you called head() and then called get() with the same locator, you would get back True or the signed locator, but not the expected blob contents. Please write a test to check this case. If you can confirm the bug, a possible fix is to not cache the results of HEAD requests.
  • There is a request timeout and a "low speed timeout" that will kill the request if the transfer time or rate exceeds a certain threshold. However, HEAD requests don't transfer anything, and block until the data is transferred by the keep server. Please make sure it won't fail if the keep-to-keep transfer is slow.
  • _copy_remote_blocks() only needs to go through the collection contents once, and can use self.items().
  • I'm a little concerned about the potential overhead of calling _copy_remote_blocks on every collection save, when in the common case we don't have any remote signatures. I wonder if you could do some benchmarking to get a sense of how expensive this extra check is when saving very large collections.

#22 Updated by Lucas Di Pentima about 1 year ago

Merged to master to unblock federation, while continuing to polish the following topics:

Peter Amstutz wrote:

  • I think when loop.success() is true blob cannot be None, so I don't know if the special case for HEAD or the 'or' clause are necessary?

Yes, I think you're right. On the other kind of HEAD requests the returned value was True so there's no point to return True or True.

  • I noticed a caching bug. It looks like it saves the result of HEAD operations in the cache. That means if you called head() and then called get() with the same locator, you would get back True or the signed locator, but not the expected blob contents. Please write a test to check this case. If you can confirm the bug, a possible fix is to not cache the results of HEAD requests.

We shouldn't get the blob contents on any HEAD request, right? I think we just use 'blob' var name because it's a special case of a GET request. OTOH, I think caching HEAD requests would be beneficial for remote locator translations, right now I avoid requesting a copy operation of the same block more than once by using that remote_blocks dict, but if that gets controlled at keep client level, it may be useful to simplify the Collection class' code, do you agree?

  • There is a request timeout and a "low speed timeout" that will kill the request if the transfer time or rate exceeds a certain threshold. However, HEAD requests don't transfer anything, and block until the data is transferred by the keep server. Please make sure it won't fail if the keep-to-keep transfer is slow.

Ok, I suppose timeouts should still be enforced, but not inexistent transfer rates. I'll check that.

  • _copy_remote_blocks() only needs to go through the collection contents once, and can use self.items().

Right, I'll fix it so we do only one pass, thanks.

  • I'm a little concerned about the potential overhead of calling _copy_remote_blocks on every collection save, when in the common case we don't have any remote signatures. I wonder if you could do some benchmarking to get a sense of how expensive this extra check is when saving very large collections.

How about checking on local collections if the manifest_text contains '+R[a-z]{5}-' and only then call _copy_remote_blocks()?

#23 Updated by Peter Amstutz about 1 year ago

Lucas Di Pentima wrote:

  • I noticed a caching bug. It looks like it saves the result of HEAD operations in the cache. That means if you called head() and then called get() with the same locator, you would get back True or the signed locator, but not the expected blob contents. Please write a test to check this case. If you can confirm the bug, a possible fix is to not cache the results of HEAD requests.

We shouldn't get the blob contents on any HEAD request, right? I think we just use 'blob' var name because it's a special case of a GET request. OTOH, I think caching HEAD requests would be beneficial for remote locator translations, right now I avoid requesting a copy operation of the same block more than once by using that remote_blocks dict, but if that gets controlled at keep client level, it may be useful to simplify the Collection class' code, do you agree?

The variable is called blob but it holds whatever was returned by the get() method. If you want to cache HEAD responses, you should add a "signed_locator" field to CacheSlot and have slot.set() take both 'contents' and the response in X-Keep-Locator. The potential bug I want to avoid is when calling head() followed by get() you don't want the get() method returning the locator when it was supposed to be returning the block contents.

  • I'm a little concerned about the potential overhead of calling _copy_remote_blocks on every collection save, when in the common case we don't have any remote signatures. I wonder if you could do some benchmarking to get a sense of how expensive this extra check is when saving very large collections.

How about checking on local collections if the manifest_text contains '+R[a-z]{5}-' and only then call _copy_remote_blocks()?

I think that might be a good idea, but a little bit of benchmarking would let us know for sure. Make sure to use the attribute self._manifest_text, not the manifest_text() method. If the collection wasn't initialized from a manifest the attribute will probably be None, so then you have to call _copy_remote_blocks() to be safe.

Thanks!

#24 Updated by Lucas Di Pentima about 1 year ago

  • Assigned To changed from Joshua Randall to Lucas Di Pentima
  • Target version changed from 2018-10-31 sprint to 2018-11-14 Sprint

#25 Updated by Lucas Di Pentima about 1 year ago

Updates at b8addea7d6c8c464a2743191561470332eeaba13
Test run: https://ci.curoverse.com/job/developer-run-tests/960/ (pending)

Addressed suggestions. My idea of matching a remote block locator against col._manifest_text won't work because sometimes that attribute isn't updated while having remote blocks. I'm in the process of gathering performance data, it's taking me a little time to build the testing environment. In the meantime the updates can be reviewed.

#26 Updated by Lucas Di Pentima about 1 year ago

After fighting too long with arvbox to set a interestingly large collection, I changed strategies making a script that builds a manifest with lots of empty files, creates the collection with save_new() and measure the save() call time when already committed, giving me these results:

  • Without _copy_remote_blocks(): ~2.2e-5 secs (without regards of number of files)
  • With _copy_remote_blocks(): ~1,2e-5 secs per file on the collection (totally unacceptable).

(tested with 10K, 100K & 1M files)

So this is a clear indication that at least on those committed collections the remote block scanning should be avoided.
I'm working on optimizations.

#27 Updated by Lucas Di Pentima about 1 year ago

Optimized the use of _copy_remote_blocks() at c3b2675
Test run: https://ci.curoverse.com/job/developer-run-tests/967/

#28 Updated by Peter Amstutz about 1 year ago

Lucas Di Pentima wrote:

Optimized the use of _copy_remote_blocks() at c3b2675
Test run: https://ci.curoverse.com/job/developer-run-tests/967/

Keeping a flag seems like the right strategy.

  • has_remote_blocks() should check the _has_remote_blocks flag
  • Both Collection.copy() and Collection.rename() call Collection.add(), so the has_remote_blocks() check should go there.
  • you should add the file part of _copy_remote_blocks() to ArvadosFile, then you don't need to do isinstance(). In addition you could access the file segments list directly (so you don't incur the overhead of a copy from calling segments()) and would be a bit cleaner for updating the segment locator directly.

#29 Updated by Lucas Di Pentima about 1 year ago

Updates at da53a8d80
Test run: https://ci.curoverse.com/job/developer-run-tests/973/

Addressed above suggestions.

As for the second bulletpoint, I did it that way at first because the add() call is done on the target_dir, that may or may not be the current collection. I've now added a recursive method to set _has_remote_blocks all the way up to the root collection.

#30 Updated by Peter Amstutz about 1 year ago

Lucas Di Pentima wrote:

Updates at da53a8d80
Test run: https://ci.curoverse.com/job/developer-run-tests/973/

Addressed above suggestions.

As for the second bulletpoint, I did it that way at first because the add() call is done on the target_dir, that may or may not be the current collection. I've now added a recursive method to set _has_remote_blocks all the way up to the root collection.

This LGTM.

#31 Updated by Lucas Di Pentima about 1 year ago

  • Status changed from In Progress to Resolved

#32 Updated by Tom Morris about 1 year ago

  • Release set to 14

Also available in: Atom PDF