Project

General

Profile

Actions

Bug #6833

closed

[FUSE] arv-mount caches manifests too long, ends up using expired Keep signatures

Added by Tom Clegg over 8 years ago. Updated over 8 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
FUSE
Target version:
Story points:
0.5

Description

Caught a VM getting 401 like this:

2015-07-30_21:20:00.75039 KeepReadError: failed to read 2e070b0bd2d5c3ca62ac8b63e7fb7d17+66595+A73d20470a444372d5120f3bc8a8e25c846576408@55b01cd1: s
ervice http://keep7.su92l.arvadosapi.com:25107/ responded with 401 HTTP/1.1 401 Unauthorized

That block signature expiry time ("55b01cd1") is ~7 days earlier than the log message. Same request worked fine after restarting arv-mount.

A general solution would be ideal, but I think it should be easy to make arv-mount handle this common case: when checking for a cached copy of a manifest, it should check the expiry times. If any are in the past (or in the very near future) it should replaced the cached manifest with a new one retrieved from the API server.


Subtasks 3 (0 open3 closed)

Task #8279: Review 6833-test-token-expiryResolvedTom Clegg07/30/2015Actions
Task #8276: Find out whether 6833-arv-mount-cache-refresh makes the test in 6833-test-token-expiry passResolvedPeter Amstutz07/30/2015Actions
Task #8127: Review branch: 6833-arv-mount-cache-refreshResolvedPeter Amstutz07/30/2015Actions

Related issues

Related to Arvados - Bug #6771: Getting 'Input/output error' on su92l keep mountClosed07/24/2015Actions
Related to Arvados - Bug #10008: [SDKs] [Python] When reading data through Collection et al., signatures should refresh automatically when neededNewActions
Has duplicate Arvados - Bug #7866: arv-mount not loading all data on qr1hi Duplicate11/25/2015Actions
Actions #1

Updated by Brett Smith over 8 years ago

  • Story points changed from 0.5 to 2.0
Actions #2

Updated by Brett Smith over 8 years ago

  • Assigned To set to Radhika Chippada
  • Target version changed from Arvados Future Sprints to 2016-01-20 Sprint
Actions #3

Updated by Peter Amstutz over 8 years ago

I think this can be as simple as, in CollectionDirectory.__init__:

self.poll = True
self.poll_time = 60 * 60

This should cause it to refresh at minimum every hour.

A slightly more clever solution would be to look up the expiry time in the discovery document and divide by half.

Actions #4

Updated by Radhika Chippada over 8 years ago

  • Status changed from New to In Progress
Actions #5

Updated by Brett Smith over 8 years ago

Peter Amstutz wrote:

I think this can be as simple as, in CollectionDirectory.__init__:

self.poll = True
self.poll_time = 60 * 60

This basic approach seems sound, but this will still have the original bug if the TTL is an hour or less, which I could imagine happening in very security-sensitive deployments that want to be able to revoke access quickly. Could we do something like self.poll_time = ttl_from_discovery_document / 2, with some bounds checking and all that?

Actions #6

Updated by Radhika Chippada over 8 years ago

Branch 6833-arv-mount-cache-refresh retrieves blobSignatureTtl from discovery document and uses it to set self.poll_time.

Actions #7

Updated by Tom Clegg over 8 years ago

6833-test-token-expiry ef5692e0 has a failing test case for this. If you fill in this part:
  •         # TODO: mock something in arvados_fuse here so it thinks
            # manifests/signatures expire in 1 second
    

...then it should reproduce the bug: i.e., the manifest is fetched, then time passes, then fuse attempts to read data using the old signatures.

A more realistic test would be to mock the collections API so the signatures really do expire in 1 second. However, since the proposed fix involves looking at the TTL advertised in the discovery doc and ignoring the signatures themselves, mocking the signatures wouldn't be enough to trigger the fix, so we might as well just mock the discovery doc and ensure fuse retrieves fresh (now+2w) signatures from the API server.

Actions #8

Updated by Radhika Chippada over 8 years ago

I am having an issue using the strategy from ef5692e0 in writing a test for this update.

The issue is: the test uses local_store and kernel caching for the fuse mount. Due to this I am not able to write a test that does something like: listdir, read file; sleep ttl; read file where token is expired.

The other alternative is to write a test with steps: listdir, sleep ttl, read file where token is expired. However, in ef5692e0, the _test_refresh_old_manifest method is being executed even before any other test steps are executed (may be because it is a @staticmethod). It would help if I can figure out how to make this fire when get is actually invoked, rather that at initialization of test.

Actions #9

Updated by Peter Amstutz over 8 years ago

  • Target version changed from 2016-01-20 Sprint to 2016-02-03 Sprint
Actions #10

Updated by Peter Amstutz over 8 years ago

  • Assigned To changed from Radhika Chippada to Peter Amstutz
Actions #11

Updated by Peter Amstutz over 8 years ago

  • Story points changed from 2.0 to 0.5
Actions #12

Updated by Tom Clegg over 8 years ago

Reviewing 6833-test-token-expiry @ 8a2645e (the parts I didn't write)

Please rebase this to master, and mock the discovery document TTL instead of the poll_time attribute of the mount. As written, the test passes even though the bug is still present. For example, I'm noticing the bugfix says things about self.poll, not self._poll, and ProjectDirectory's constructor looks like it will overwrite whatever CollectionDirectory puts in self._poll.

Actions #13

Updated by Tom Clegg over 8 years ago

f82d809 LGTM

Re MountTestBase... the main improvements in IntegrationTest over MountTestBase are
  • it uses the real fuse setup code, so you test the way the mounts get set up in real life, which is a bit hairy
  • you can make a class with a bunch of tests (one test_* and one matching static test* method) instead of having to make a class and a lone function outside the class for each test

It also uses a non-admin user by default -- right now the tests are pretty heavy on the "how well things work for admins" side.

Actions #14

Updated by Peter Amstutz over 8 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 67 to 100
Actions

Also available in: Atom PDF