Project

General

Profile

Actions

Idea #8937

closed

[SDKs] Write integration test for when arv-put resumes from a cache with expired access tokens

Added by Brett Smith almost 8 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
SDKs
Target version:
Story points:
0.5

Description

arv-put does not check the validity of the permission signatures on its blocks when it resumes from cache. When it resumes from a cache with invalid permission signatures, it will continue uploading blocks, but the ultimate collection creation at the end of the process will fail.

Update the Python SDK:

  • Add a head() method to the KeepClient class. It takes a block locator and num_retries as arguments. It sends a HEAD request for that block to all accessible Keep services in the usual rendezvous order. It returns True if one service returns 200 OK, and raises an exception if it never gets that response. It retries following the same logic as the get() method.
  • Add tests for this method that check behavior when:
    • One accessible service returns 200 OK.
    • The last accessible service returns 200 OK.
    • All accessible services return a permanent error.
    • All accessible services return an error. Some of those errors are temporary. On a second request, at least one of the services returns 200 OK.
    • All accessible services return temporary errors, enough to exhaust the number of retries.

Update arv-put:

  • If the cache is otherwise usable (the file list is the same, the files are unchanged, etc.), use KeepClient's new head method to check the first block locator in the manifest in the cache.
    • The first will always be the oldest and most likely to fail, so it is the best one to check. We talked about potentially checking all of them, and that is more thorough, but it's also potentially much more expensive.
    • The rest of the cache checks are much cheaper than the HEAD request. We want to do this last, because if the cache is invalid for any other reason, we can notice and restart much faster.
  • If the head request confirms the block locator is still valid, continue from the cache as before.
  • Otherwise, invalidate the cache and start from cache.
  • Test that arv-put behaves correctly:
    • In existing tests that arv-put resumes from a valid cache, update those tests to simulate a 200 OK response to the HEAD request. The rest of the tested behavior should be preserved as before.
    • Write a test to verify that arv-put invalidates the cache in a situation where the HEAD request fails.

Subtasks 3 (0 open3 closed)

Task #12006: Review 8937-arvput-cached-tokensResolvedPeter Amstutz04/13/2016Actions
Task #8979: Review branch: 8937-head-request-in-python-keep-clientResolvedRadhika Chippada04/13/2016Actions
Task #9041: Review branch 8937-arv-put-cache-checkResolvedRadhika Chippada04/25/2016Actions

Related issues

Related to Arvados - Bug #8867: 'arv-put' reports 100% upload but files only partially uploadedResolved04/01/2016Actions
Related to Arvados - Feature #8993: arv-put: options for 3 modes of "resumption"Closed04/14/2016Actions
Actions #1

Updated by Brett Smith almost 8 years ago

  • Description updated (diff)
Actions #2

Updated by Brett Smith almost 8 years ago

  • Target version set to Arvados Future Sprints
Actions #3

Updated by Brett Smith almost 8 years ago

  • Description updated (diff)
  • Category set to SDKs
  • Story points set to 2.0
Actions #4

Updated by Brett Smith almost 8 years ago

  • Assigned To set to Radhika Chippada
  • Target version changed from Arvados Future Sprints to 2016-04-27 sprint
Actions #5

Updated by Radhika Chippada almost 8 years ago

  • Status changed from New to In Progress
Actions #6

Updated by Peter Amstutz almost 8 years ago

Reviewing 20fc2f7

HEAD requests should ignore the cache. Revert the changes to KeepBlockCache and don't make any calls to self.block_cache when method == "HEAD"

In KeepService.get() it should be self._result.get('content-length')) (the response) not self._headers.get('content-length')) (the request)

KeepService.get() should return True on a successful HEAD per the story description, not the content-length.

Actions #7

Updated by Peter Amstutz almost 8 years ago

ResumeCache already validates that the files being uploaded have probably not changed on disk, so there is nothing to do there.

ResumeCache.__init__ should do the following:

  • Call state = self.load()
  • If state["_finished_streams"] is not empty, then get the first locator from locator = state["_finished_streams"][0].split(' ')[1]
  • Otherwise, if state["_finished_streams"] is empty, get the first locator from locator = state["_current_stream_locators"].split(' ')[0]
  • Create an instance of keepclient = KeepClient()
  • Call keepclient.head(locator)

Add KeepRequestError to the exception handler around the ResumeCache to reset the cache:

        try:
            cachepath = ResumeCache.make_path(args)
            resume_cache = ResumeCache(cachepath)
        except (IOError, OSError, ValueError):
            pass  # Couldn't open cache directory/file.  Continue without it.
        except KeepRequestError:
            # delete the cache and create a new one
            shutil.rmtree(cachepath)
            resume_cache = ResumeCache(cachepath)
        except ResumeCacheConflict:
            print >>stderr, "\n".join([
                "arv-put: Another process is already uploading this data.",
                "         Use --no-resume if this is really what you want."])
            sys.exit(1)
Actions #8

Updated by Radhika Chippada almost 8 years ago

  • Addressed comments from note 6
  • Branch 8937-arv-put-cache-check handles the updates to arv-put where a head request is made on the first locator and the cache is recreated on errors. 3fc755d67
Actions #9

Updated by Peter Amstutz almost 8 years ago

On review, I think we want to refactor a little bit, I apologize.

I suggest that the block in ResumeCache.__init__ goes into its own method, check_cache():

    def check_cache(self):
        try:
            state = self.load()
            locator = None
            try:
                if "_finished_streams" in state and len(state["_finished_streams"]) > 0:
                    locator = state["_finished_streams"][0][1][0]
                elif "_current_stream_locators" in state and len(state["_current_stream_locators"]) > 0:
                    locator = state["_current_stream_locators"][0]
                if locator is not None:
                    kc = arvados.keep.KeepClient(api_client=api_client)
                    kc.head(locator, num_retries=num_retries)
            except Exception as e:
                self.restart()
        except (ValueError):
            pass

Then, the block in main changes a little bit:

    if args.resume:
        try:
            cachepath = ResumeCache.make_path(args)
            resume_cache = ResumeCache(cachepath, api_client=api_client, num_retries=args.retries)
            resume_cache.check_cache()
        except (IOError, OSError, ValueError):
            pass  # Couldn't open cache directory/file.  Continue without it.
        except ResumeCacheConflict:
            print >>stderr, "\n".join([
                "arv-put: Another process is already uploading this data.",
                "         Use --no-resume if this is really what you want."])
            sys.exit(1)

The ResumeCache.restart() method takes care of managing the cache file correctly, which I overlooked in the original suggestion for implementation.

Actions #10

Updated by Radhika Chippada almost 8 years ago

Commit 8820ca52 incorporates those suggestions. Thanks.

Actions #11

Updated by Peter Amstutz almost 8 years ago

======================================================================
FAIL: test_ArvPutSignedManifest (tests.test_arv_put.ArvPutIntegrationTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/src/arvados/sdk/python/tests/test_arv_put.py", line 575, in test_ArvPutSignedManifest
    self.assertEqual(p.returncode, 0)
AssertionError: 1 != 0

======================================================================
FAIL: test_put_collection_with_name_and_no_project (tests.test_arv_put.ArvPutIntegrationTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/src/arvados/sdk/python/tests/test_arv_put.py", line 625, in test_put_collection_with_name_and_no_project
    ['--portable-data-hash', '--name', link_name])
  File "/usr/src/arvados/sdk/python/tests/test_arv_put.py", line 598, in run_and_find_collection
    self.assertEqual(1, len(collection_list))
AssertionError: 1 != 0

======================================================================
FAIL: test_put_collection_with_named_project_link (tests.test_arv_put.ArvPutIntegrationTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/src/arvados/sdk/python/tests/test_arv_put.py", line 635, in test_put_collection_with_named_project_link
    '--project-uuid', self.PROJECT_UUID])
  File "/usr/src/arvados/sdk/python/tests/test_arv_put.py", line 598, in run_and_find_collection
    self.assertEqual(1, len(collection_list))
AssertionError: 1 != 0

======================================================================
FAIL: test_put_collection_with_unnamed_project_link (tests.test_arv_put.ArvPutIntegrationTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/src/arvados/sdk/python/tests/test_arv_put.py", line 615, in test_put_collection_with_unnamed_project_link
    ['--portable-data-hash', '--project-uuid', self.PROJECT_UUID])
  File "/usr/src/arvados/sdk/python/tests/test_arv_put.py", line 598, in run_and_find_collection
    self.assertEqual(1, len(collection_list))
AssertionError: 1 != 0
Actions #12

Updated by Peter Amstutz almost 8 years ago

$ arvbox restart test --only sdk/python  sdk/python_test=--test-suite=tests.test_arv_put.ArvPutIntegrationTest.test_ArvPutSignedManifest
...
[keep0] 2016/04/27 16:33:33 keepstore starting, pid 1034
[keep0] 2016/04/27 16:33:33 Using volume [UnixVolume /tmp/tmpUuhWG8] (writable=true)
[keep0] 2016/04/27 16:33:33 never-delete is not set. Warning: the relevant features in keepstore and data manager have not been extensively tested. You should leave this option alone unless you can afford to lose data.
[keep0] 2016/04/27 16:33:33 listening at :52737
[keep1] 2016/04/27 16:33:34 keepstore starting, pid 1042
[keep1] 2016/04/27 16:33:34 Using volume [UnixVolume /tmp/tmpxnkxp3] (writable=true)
[keep1] 2016/04/27 16:33:34 never-delete is not set. Warning: the relevant features in keepstore and data manager have not been extensively tested. You should leave this option alone unless you can afford to lose data.
[keep1] 2016/04/27 16:33:34 listening at :42993
0M / 0M 0.0% [keep1] 2016/04/27 16:33:34 [[::1]:55496] PUT 08a008a01d498c404b0c30852b39d3b8 44 0.000962s 0.000956s 0.000006s 200 87 "OK" 
[keep0] 2016/04/27 16:33:34 [[::1]:40156] PUT 08a008a01d498c404b0c30852b39d3b8 44 0.000863s 0.000857s 0.000006s 200 87 "OK" 
0M / 0M 100.0% 
arv-put: Error creating Collection on project: <HttpError 403 when requesting https://0.0.0.0:43382/arvados/v1/collections?ensure_unique_name=true&alt=json returned "#<ArvadosModel::PermissionDeniedError: ArvadosModel::PermissionDeniedError>">.
Traceback (most recent call last):
  File "/usr/src/arvados/sdk/python/arvados/commands/put.py", line 567, in <module>
    main()
  File "/usr/src/arvados/sdk/python/arvados/commands/put.py", line 551, in main
    stdout.write(output)
UnboundLocalError: local variable 'output' referenced before assignment
FAIL

Not sure why it's getting permission denied error, but that might be what's going on here.

Actions #13

Updated by Radhika Chippada almost 8 years ago

  • Target version changed from 2016-04-27 sprint to 2016-05-11 sprint
  • Story points changed from 2.0 to 0.5

Merged code into master in e62a18f3 and moving to next sprint to add the integration test with 0.5 story points.

Actions #14

Updated by Peter Amstutz almost 8 years ago

Here's an empty example of what the cache file looks like:

{
  "_current_stream_files": [],
  "_current_stream_length": 0,
  "_current_stream_locators": [],
  "_current_stream_name": ".",
  "_current_file": None,
  "_current_file_name": null,
  "_current_file_pos": 0,
  "_close_file": null,
  "_data_buffer": "",
  "_dependencies": {},
  "_finished_streams": [],
  "_queued_dirents": [],
  "_queued_trees": []
}
Actions #15

Updated by Brett Smith almost 8 years ago

  • Subject changed from [SDKs] When arv-put resumes, it does a HEAD request on the first block, and invalidates the cache if that request fails to [SDKs] Write integration test for when arv-put resumes from a cache with expired access tokens
  • Status changed from In Progress to New
  • Assigned To deleted (Radhika Chippada)
  • Target version changed from 2016-05-11 sprint to Arvados Future Sprints
Actions #16

Updated by Tom Morris over 6 years ago

  • Assigned To set to Lucas Di Pentima
  • Target version changed from Arvados Future Sprints to 2017-07-19 sprint
Actions #17

Updated by Lucas Di Pentima over 6 years ago

  • Target version changed from 2017-07-19 sprint to 2017-08-02 sprint
Actions #18

Updated by Lucas Di Pentima over 6 years ago

  • Target version changed from 2017-08-02 sprint to 2017-08-16 sprint
Actions #19

Updated by Lucas Di Pentima over 6 years ago

  • Status changed from New to In Progress
Actions #20

Updated by Lucas Di Pentima over 6 years ago

Updates at 7b93dc26e
Test run: https://ci.curoverse.com/job/developer-run-tests/409/

  • Updated arv-put code to do a cache file validation before loading. It checks the cached manifest's first signed block locator against keep using the HEAD method.
  • Added related integration test.
Actions #21

Updated by Peter Amstutz over 6 years ago

Looking at _cache_is_valid

  • Instead of splitting the manifest into lines, could you use re.MULTILINE to find the first token?
  • KeepClient.head() will raise an exception if the block is inaccessible. The log message "Something wrong happened" is a little vague and misleading since this is an expected error. Consider catching KeepRequestError and printing a more precise error message.
Actions #22

Updated by Lucas Di Pentima over 6 years ago

Updates at 7049888ba
Test run: https://ci.curoverse.com/job/developer-run-tests/413/

Made suggested changes.
Catching the specific KeepRequestError exception revealed some additional cases where other exceptions don't mean an invalid cache file.

Actions #23

Updated by Peter Amstutz over 6 years ago

Lucas Di Pentima wrote:

Updates at 7049888ba
Test run: https://ci.curoverse.com/job/developer-run-tests/413/

Made suggested changes.
Catching the specific KeepRequestError exception revealed some additional cases where other exceptions don't mean an invalid cache file.

It seems weird that if the manifest is empty or invalid (lacks any signed blocks) it would still treat that as "valid cache" ? Seems like the logic should be flipped around, the only case where you want to use the cache is when it is both well formed and the signatures are valid.

Actions #24

Updated by Lucas Di Pentima over 6 years ago

Updates at 81352ba32
Test run: https://ci.curoverse.com/job/developer-run-tests/414/

Realized that as part of #10383 we already added a token expiration check (see https://github.com/curoverse/arvados/blob/master/sdk/python/arvados/commands/put.py#L710), so I removed the _cache_is_valid() method, and updated the integration test to work with the already existing code.

Actions #25

Updated by Peter Amstutz over 6 years ago

LGTM

Actions #26

Updated by Anonymous over 6 years ago

  • Status changed from In Progress to Resolved

Applied in changeset arvados|commit:51fe3482a5253d054e838fb1dcf3982f79650cee.

Actions

Also available in: Atom PDF