Idea #8937
closed[SDKs] Write integration test for when arv-put resumes from a cache with expired access tokens
Description
arv-put does not check the validity of the permission signatures on its blocks when it resumes from cache. When it resumes from a cache with invalid permission signatures, it will continue uploading blocks, but the ultimate collection creation at the end of the process will fail.
Update the Python SDK:
- Add a head() method to the KeepClient class. It takes a block locator and num_retries as arguments. It sends a HEAD request for that block to all accessible Keep services in the usual rendezvous order. It returns True if one service returns 200 OK, and raises an exception if it never gets that response. It retries following the same logic as the get() method.
- Add tests for this method that check behavior when:
- One accessible service returns 200 OK.
- The last accessible service returns 200 OK.
- All accessible services return a permanent error.
- All accessible services return an error. Some of those errors are temporary. On a second request, at least one of the services returns 200 OK.
- All accessible services return temporary errors, enough to exhaust the number of retries.
Update arv-put:
- If the cache is otherwise usable (the file list is the same, the files are unchanged, etc.), use KeepClient's new head method to check the first block locator in the manifest in the cache.
- The first will always be the oldest and most likely to fail, so it is the best one to check. We talked about potentially checking all of them, and that is more thorough, but it's also potentially much more expensive.
- The rest of the cache checks are much cheaper than the HEAD request. We want to do this last, because if the cache is invalid for any other reason, we can notice and restart much faster.
- If the head request confirms the block locator is still valid, continue from the cache as before.
- Otherwise, invalidate the cache and start from cache.
- Test that arv-put behaves correctly:
- In existing tests that arv-put resumes from a valid cache, update those tests to simulate a 200 OK response to the HEAD request. The rest of the tested behavior should be preserved as before.
- Write a test to verify that arv-put invalidates the cache in a situation where the HEAD request fails.
Related issues
Updated by Brett Smith over 8 years ago
- Target version set to Arvados Future Sprints
Updated by Brett Smith over 8 years ago
- Description updated (diff)
- Category set to SDKs
- Story points set to 2.0
Updated by Brett Smith over 8 years ago
- Assigned To set to Radhika Chippada
- Target version changed from Arvados Future Sprints to 2016-04-27 sprint
Updated by Radhika Chippada over 8 years ago
- Status changed from New to In Progress
Updated by Peter Amstutz over 8 years ago
Reviewing 20fc2f7
HEAD requests should ignore the cache. Revert the changes to KeepBlockCache
and don't make any calls to self.block_cache
when method == "HEAD"
In KeepService.get()
it should be self._result.get('content-length'))
(the response) not self._headers.get('content-length'))
(the request)
KeepService.get()
should return True
on a successful HEAD
per the story description, not the content-length.
Updated by Peter Amstutz over 8 years ago
ResumeCache
already validates that the files being uploaded have probably not changed on disk, so there is nothing to do there.
ResumeCache.__init__ should do the following:
- Call
state = self.load()
- If state["_finished_streams"] is not empty, then get the first locator from
locator = state["_finished_streams"][0].split(' ')[1]
- Otherwise, if state["_finished_streams"] is empty, get the first locator from
locator = state["_current_stream_locators"].split(' ')[0]
- Create an instance of
keepclient = KeepClient()
- Call
keepclient.head(locator)
Add KeepRequestError to the exception handler around the ResumeCache to reset the cache:
try: cachepath = ResumeCache.make_path(args) resume_cache = ResumeCache(cachepath) except (IOError, OSError, ValueError): pass # Couldn't open cache directory/file. Continue without it. except KeepRequestError: # delete the cache and create a new one shutil.rmtree(cachepath) resume_cache = ResumeCache(cachepath) except ResumeCacheConflict: print >>stderr, "\n".join([ "arv-put: Another process is already uploading this data.", " Use --no-resume if this is really what you want."]) sys.exit(1)
Updated by Radhika Chippada over 8 years ago
- Addressed comments from note 6
- Branch 8937-arv-put-cache-check handles the updates to arv-put where a head request is made on the first locator and the cache is recreated on errors. 3fc755d67
Updated by Peter Amstutz over 8 years ago
On review, I think we want to refactor a little bit, I apologize.
I suggest that the block in ResumeCache.__init__ goes into its own method, check_cache()
:
def check_cache(self): try: state = self.load() locator = None try: if "_finished_streams" in state and len(state["_finished_streams"]) > 0: locator = state["_finished_streams"][0][1][0] elif "_current_stream_locators" in state and len(state["_current_stream_locators"]) > 0: locator = state["_current_stream_locators"][0] if locator is not None: kc = arvados.keep.KeepClient(api_client=api_client) kc.head(locator, num_retries=num_retries) except Exception as e: self.restart() except (ValueError): pass
Then, the block in main changes a little bit:
if args.resume: try: cachepath = ResumeCache.make_path(args) resume_cache = ResumeCache(cachepath, api_client=api_client, num_retries=args.retries) resume_cache.check_cache() except (IOError, OSError, ValueError): pass # Couldn't open cache directory/file. Continue without it. except ResumeCacheConflict: print >>stderr, "\n".join([ "arv-put: Another process is already uploading this data.", " Use --no-resume if this is really what you want."]) sys.exit(1)
The ResumeCache.restart() method takes care of managing the cache file correctly, which I overlooked in the original suggestion for implementation.
Updated by Radhika Chippada over 8 years ago
Commit 8820ca52 incorporates those suggestions. Thanks.
Updated by Peter Amstutz over 8 years ago
====================================================================== FAIL: test_ArvPutSignedManifest (tests.test_arv_put.ArvPutIntegrationTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/src/arvados/sdk/python/tests/test_arv_put.py", line 575, in test_ArvPutSignedManifest self.assertEqual(p.returncode, 0) AssertionError: 1 != 0 ====================================================================== FAIL: test_put_collection_with_name_and_no_project (tests.test_arv_put.ArvPutIntegrationTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/src/arvados/sdk/python/tests/test_arv_put.py", line 625, in test_put_collection_with_name_and_no_project ['--portable-data-hash', '--name', link_name]) File "/usr/src/arvados/sdk/python/tests/test_arv_put.py", line 598, in run_and_find_collection self.assertEqual(1, len(collection_list)) AssertionError: 1 != 0 ====================================================================== FAIL: test_put_collection_with_named_project_link (tests.test_arv_put.ArvPutIntegrationTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/src/arvados/sdk/python/tests/test_arv_put.py", line 635, in test_put_collection_with_named_project_link '--project-uuid', self.PROJECT_UUID]) File "/usr/src/arvados/sdk/python/tests/test_arv_put.py", line 598, in run_and_find_collection self.assertEqual(1, len(collection_list)) AssertionError: 1 != 0 ====================================================================== FAIL: test_put_collection_with_unnamed_project_link (tests.test_arv_put.ArvPutIntegrationTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/src/arvados/sdk/python/tests/test_arv_put.py", line 615, in test_put_collection_with_unnamed_project_link ['--portable-data-hash', '--project-uuid', self.PROJECT_UUID]) File "/usr/src/arvados/sdk/python/tests/test_arv_put.py", line 598, in run_and_find_collection self.assertEqual(1, len(collection_list)) AssertionError: 1 != 0
Updated by Peter Amstutz over 8 years ago
$ arvbox restart test --only sdk/python sdk/python_test=--test-suite=tests.test_arv_put.ArvPutIntegrationTest.test_ArvPutSignedManifest ... [keep0] 2016/04/27 16:33:33 keepstore starting, pid 1034 [keep0] 2016/04/27 16:33:33 Using volume [UnixVolume /tmp/tmpUuhWG8] (writable=true) [keep0] 2016/04/27 16:33:33 never-delete is not set. Warning: the relevant features in keepstore and data manager have not been extensively tested. You should leave this option alone unless you can afford to lose data. [keep0] 2016/04/27 16:33:33 listening at :52737 [keep1] 2016/04/27 16:33:34 keepstore starting, pid 1042 [keep1] 2016/04/27 16:33:34 Using volume [UnixVolume /tmp/tmpxnkxp3] (writable=true) [keep1] 2016/04/27 16:33:34 never-delete is not set. Warning: the relevant features in keepstore and data manager have not been extensively tested. You should leave this option alone unless you can afford to lose data. [keep1] 2016/04/27 16:33:34 listening at :42993 0M / 0M 0.0% [keep1] 2016/04/27 16:33:34 [[::1]:55496] PUT 08a008a01d498c404b0c30852b39d3b8 44 0.000962s 0.000956s 0.000006s 200 87 "OK" [keep0] 2016/04/27 16:33:34 [[::1]:40156] PUT 08a008a01d498c404b0c30852b39d3b8 44 0.000863s 0.000857s 0.000006s 200 87 "OK" 0M / 0M 100.0% arv-put: Error creating Collection on project: <HttpError 403 when requesting https://0.0.0.0:43382/arvados/v1/collections?ensure_unique_name=true&alt=json returned "#<ArvadosModel::PermissionDeniedError: ArvadosModel::PermissionDeniedError>">. Traceback (most recent call last): File "/usr/src/arvados/sdk/python/arvados/commands/put.py", line 567, in <module> main() File "/usr/src/arvados/sdk/python/arvados/commands/put.py", line 551, in main stdout.write(output) UnboundLocalError: local variable 'output' referenced before assignment FAIL
Not sure why it's getting permission denied error, but that might be what's going on here.
Updated by Radhika Chippada over 8 years ago
- Target version changed from 2016-04-27 sprint to 2016-05-11 sprint
- Story points changed from 2.0 to 0.5
Merged code into master in e62a18f3 and moving to next sprint to add the integration test with 0.5 story points.
Updated by Peter Amstutz over 8 years ago
Here's an empty example of what the cache file looks like:
{ "_current_stream_files": [], "_current_stream_length": 0, "_current_stream_locators": [], "_current_stream_name": ".", "_current_file": None, "_current_file_name": null, "_current_file_pos": 0, "_close_file": null, "_data_buffer": "", "_dependencies": {}, "_finished_streams": [], "_queued_dirents": [], "_queued_trees": [] }
Updated by Brett Smith over 8 years ago
- Subject changed from [SDKs] When arv-put resumes, it does a HEAD request on the first block, and invalidates the cache if that request fails to [SDKs] Write integration test for when arv-put resumes from a cache with expired access tokens
- Status changed from In Progress to New
- Assigned To deleted (
Radhika Chippada) - Target version changed from 2016-05-11 sprint to Arvados Future Sprints
Updated by Tom Morris over 7 years ago
- Assigned To set to Lucas Di Pentima
- Target version changed from Arvados Future Sprints to 2017-07-19 sprint
Updated by Lucas Di Pentima over 7 years ago
- Target version changed from 2017-07-19 sprint to 2017-08-02 sprint
Updated by Lucas Di Pentima over 7 years ago
- Target version changed from 2017-08-02 sprint to 2017-08-16 sprint
Updated by Lucas Di Pentima over 7 years ago
- Status changed from New to In Progress
Updated by Lucas Di Pentima about 7 years ago
Updates at 7b93dc26e
Test run: https://ci.curoverse.com/job/developer-run-tests/409/
- Updated
arv-put
code to do a cache file validation before loading. It checks the cached manifest's first signed block locator against keep using theHEAD
method. - Added related integration test.
Updated by Peter Amstutz about 7 years ago
Looking at _cache_is_valid
- Instead of splitting the manifest into lines, could you use re.MULTILINE to find the first token?
- KeepClient.head() will raise an exception if the block is inaccessible. The log message "Something wrong happened" is a little vague and misleading since this is an expected error. Consider catching KeepRequestError and printing a more precise error message.
Updated by Lucas Di Pentima about 7 years ago
Updates at 7049888ba
Test run: https://ci.curoverse.com/job/developer-run-tests/413/
Made suggested changes.
Catching the specific KeepRequestError
exception revealed some additional cases where other exceptions don't mean an invalid cache file.
Updated by Peter Amstutz about 7 years ago
Lucas Di Pentima wrote:
Updates at 7049888ba
Test run: https://ci.curoverse.com/job/developer-run-tests/413/Made suggested changes.
Catching the specificKeepRequestError
exception revealed some additional cases where other exceptions don't mean an invalid cache file.
It seems weird that if the manifest is empty or invalid (lacks any signed blocks) it would still treat that as "valid cache" ? Seems like the logic should be flipped around, the only case where you want to use the cache is when it is both well formed and the signatures are valid.
Updated by Lucas Di Pentima about 7 years ago
Updates at 81352ba32
Test run: https://ci.curoverse.com/job/developer-run-tests/414/
Realized that as part of #10383 we already added a token expiration check (see https://github.com/curoverse/arvados/blob/master/sdk/python/arvados/commands/put.py#L710), so I removed the _cache_is_valid()
method, and updated the integration test to work with the already existing code.
Updated by Anonymous about 7 years ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|commit:51fe3482a5253d054e838fb1dcf3982f79650cee.