Idea #3706
closed[SDKs] Remove fallback-to-keep warning from python SDK if block hash has a permission signature
Description
Background: Crunch job tasks can use Keep as a cache for manifest fragments, which crunch-job retrieves at the end of the job using arv-get {block-locator}
. The arv-get
program does not know at this point whether the supplied block-locator
is a reference to a block to retrieve from Keep or a collection to retrieve from the API server.
Sat Sep 27 02:00:42 2014 2014-09-27 02:00:41 arvados.collection[7714] WARNING: \ API server did not return Collection '4de3386d281fec68fdc91c31d5ffd3fe+51502+Axxx@54380491'. Trying to fetch directly from Keep (deprecated).
- If
arv-get
is asked to retrieve something that looks like a block hash and has a permission signature hint, it should try fetching it from Keep first, then issue a warning and fall back on API server. - If
arv-get
is asked to retrieve something that looks like a block hash and has no permission signature hint, it should try fetching it from API server, then fall back on Keep. If both fail, the warning should indicate that both methods were attempted and both failed. The existing warning ("plan A failed, trying plan B now") should be removed, i.e., if Keep succeeds, no warning is printed.- This scenario is useful only when keepstore services are configured to permit anonymous unauthenticated read access. Otherwise, it wastes an HTTP request but is otherwise harmless.
Updated by Tom Clegg about 10 years ago
- Subject changed from Remove fallback-to-keep code from arv-mount to Remove fallback-to-keep warning from python SDK if block hash has a permission signature
- Assigned To set to Tom Clegg
Updated by Tom Clegg about 10 years ago
- Subject changed from Remove fallback-to-keep warning from python SDK if block hash has a permission signature to [SDKs] Remove fallback-to-keep warning from python SDK if block hash has a permission signature
- Description updated (diff)
Updated by Tom Clegg about 10 years ago
- Tracker changed from Task to Idea
- Target version changed from 2014-10-08 sprint to Arvados Future Sprints
- Story points set to 0.5
Updated by Ward Vandewege about 10 years ago
- Target version changed from Arvados Future Sprints to 2014-10-29 sprint
Updated by Brett Smith about 10 years ago
- Assigned To changed from Tom Clegg to Brett Smith
Stealing the ticket because it's easy and I'm working in this space anyway for #4248.
Updated by Brett Smith about 10 years ago
- Assigned To changed from Brett Smith to Tom Clegg
Un-stealing this ticket because #4248 isn't even a bug, so I'm not working on this anymore (although working on the code helped me realize my mistake). Sorry for the noise.
Updated by Tom Clegg about 10 years ago
- Target version changed from 2014-10-29 sprint to 2014-11-19 sprint
Updated by Tom Clegg about 10 years ago
"arv-get {manifesthash} -"
makes a CollectionReader and spits out its manifest_text(). This could (should?) be done differently -- this form is handy for hashes of non-manifest data blocks too, which we shouldn't be [coming so close to] trying to parse -- but in the meantime at least it's easy to make CollectionReader return the manifest just as it was retrieved.- The --max-manifest-depth argument was ineffective. We actually had tests confirming that max_manifest_depth was being ignored!
- Generally, CollectionReader and CollectionWriter would accept A and behave as if they had been given B -- hashes don't match, files/streams are reordered, etc. -- and there was no way for the caller to avoid this short of re-implementing them.
I fixed this by adding a normalize() method, and calling it from arv-normalize and (when --normalize is given) arv-put. (The arv-put usage message mentions that --normalize makes --max-manifest-depth ineffective.)
Incidentally, normalize() was masking a CollectionWriter bug where calling finish_current_file()
followed by finish()
or manifest_text()
would produce an extra zero-length file with the same name as the previous file.
- CollectionReader's stripped_manifest() method was (incorrectly) stripping only permission hints. It was also stripping +A[stuff] from any filenames (but the last file in the stream was safe). Meanwhile, CollectionWriter's
manifest_text(strip=True)
path was (correctly) stripping all non-size hints, using Stream's "stripped_manifest" method. Now, both CollectionReader and CollectionWriter use a commonstripped_manifest()
method which doesn't have those two bugs. - I moved some regular expressions from CollectionReader to the util package.
I noticed that the Keep client's error message, when it can't find a block, is just the block hash it was looking for. This is a bit confusing and not especially helpful, but I decided not to address that here too. The "normalize" adventure was quite enough.
Updated by Brett Smith about 10 years ago
First, mea culpa on the bad tests for max_manifest_depth. I wrote those when I was adding retry support to arv-put, which was my first big foray into Keep. I think I saw how CollectionWriter put files above the limit in the same block, and figured that was the intended behavior. This is an excellent lesson in the dangers of writing tests after the fact.
Main comments on 0e8d9b4:
- Would it make more sense to move the populate try/except work into the individual _populate methods? They could return any exception raised, or else None. Then the main _populate can just say
error_via_keep = self._populate_from_keep()
. It would at least cut down on some redundant code. - Two things about
except e
(although this might just be a typo). First, this does not do what you might think: rather than catching every exception and saving it toe
, it tries to match the raised exception againste
. For example,e
could be a tuple of exception classes defined at runtime. This code as-is will raise a NameError trying to finde
if an exception is raised.
Second, a bareexcept
in Python is almost always a mistake, because it will catch BaseException, which includes KeyboardInterrupt, MemoryError, and other exceptions you almost certainly don't want to catch. Yes, it's unfortunate that this is the exact opposite of the situation in Ruby. Them's the breaks.
Preferexcept Exception as e
, as elsewhere in this code. - Having normalize()
return self
is really un-Pythonic. Methods that primarily mutate internal state conventionally return None (a.k.a. don't have a return) to signal exactly this to the caller. If we want our SDK to feel familiar to people familiar with Python, we should follow suit.
What if we added anormalize=False
argument to manifest_text()? That seems to be the primary use case. - Since
--normalize
and--max-manifest-depth
are incompatible, let's put them into a mutually_exclusive_group in the parser, soarv-put
will generate a nice error message if the user specifies both. - I'm not sure I see where the API server is mocked to return the wrong data in
test_try_keep_first_if_permission_hint
. e.g., you could callmock_get_collection
with a collection other thanfoo_file
. test_read_nonnormalized_manifest_with_collection_reader
can just check againstnonnormal
in the firstassertEqual
. Because Python's strings are immutable, there's no risk that the tested function could warp the test.- The Python style guide calls for two blank lines after a class definition; this got lost after CollectionBase.
I wrote some additional style comments while I was going through, before I realized that these applied to code that you merely moved around, and didn't actually touch. Brownie points if you're up for addressing them, but no need to hold up the merge if you're not:
- The sorting part of normalize() could be one-lined:
self._streams = [normalize_stream(s, streams[s]) for s in sorted(streams)]
- Empty collections are falsy in Python, so
if list
is preferred overif len(list) > 0
. - String concatenation is relatively expensive in Python. It's preferable to build an array and turn it into a string when you're done with str.join (e.g., in stripped_manifest()).
Thanks.
Updated by Tom Clegg about 10 years ago
Brett Smith wrote:
First, mea culpa on the bad tests for max_manifest_depth. I wrote those when I was adding retry support to arv-put, which was my first big foray into Keep. I think I saw how CollectionWriter put files above the limit in the same block, and figured that was the intended behavior. This is an excellent lesson in the dangers of writing tests after the fact.
Yeah, I had assumed this was a "make sure this continues doing whatever it did before I touched it"...
- Would it make more sense to move the populate try/except work into the individual _populate methods? They could return any exception raised, or else None. Then the main _populate can just say
error_via_keep = self._populate_from_keep()
. It would at least cut down on some redundant code.
That had occurred to me but I didn't get as far as convincing myself it was that much better. But now that you mention it, I've done it and I think it was worth it.
Prefer
except Exception as e
, as elsewhere in this code.
Whoops. Fixed, thank you.
- Having normalize()
return self
is really un-Pythonic. Methods that primarily mutate internal state conventionally return None (a.k.a. don't have a return) to signal exactly this to the caller. If we want our SDK to feel familiar to people familiar with Python, we should follow suit.
What if we added anormalize=False
argument to manifest_text()? That seems to be the primary use case.
I've rearranged this as
foo.normalize()
→ mutate and return None.foo.manifest_text(normalize=True)
→ don't mutate, just return a normalized manifest_text. (Under the hood, create a new CollectionReader and normalize() it.)
- Since
--normalize
and--max-manifest-depth
are incompatible, let's put them into a mutually_exclusive_group in the parser, soarv-put
will generate a nice error message if the user specifies both.
Good call. Done.
- I'm not sure I see where the API server is mocked to return the wrong data in
test_try_keep_first_if_permission_hint
. e.g., you could callmock_get_collection
with a collection other thanfoo_file
.
Indeed, apparently it wasn't. I've added an ALT_COLLECTION
(using the "bar_file" fixture) and now it really does test what it says it tests. (This time I tried breaking the code and making sure the test failed.)
test_read_nonnormalized_manifest_with_collection_reader
can just check againstnonnormal
in the firstassertEqual
. Because Python's strings are immutable, there's no risk that the tested function could warp the test.
If I'm following you correctly: Those strings aren't identical. I'm guessing I confused you by testing two things at once here: (1) the files don't get rearranged, (2) the permission signature gets stripped. I split this into two assertions, and used re.sub() instead of giving another literal to make it (hopefully) more clear what's happening.
- The Python style guide calls for two blank lines after a class definition; this got lost after CollectionBase.
Fixed.
I wrote some additional style comments while I was going through, before I realized that these applied to code that you merely moved around, and didn't actually touch. Brownie points if you're up for addressing them, but no need to hold up the merge if you're not:
Here I go, apparently I want those brownie points...
- The sorting part of normalize() could be one-lined:
self._streams = [normalize_stream(s, streams[s]) for s in sorted(streams)]
Done.
- Empty collections are falsy in Python, so
if list
is preferred overif len(list) > 0
.
Fixed ×3.
- String concatenation is relatively expensive in Python. It's preferable to build an array and turn it into a string when you're done with str.join (e.g., in stripped_manifest()).
55a0e06 better? It does 1+Nstreams joins. It also avoids making an exception for the last token: although it's true that the last token is always a file token, the old code hinted that only the last token was exempt, which is a bit of a red herring.
Now at ce2362b
Updated by Brett Smith about 10 years ago
Reviewing ce2362b
Tom Clegg wrote:
Brett Smith wrote:
- Would it make more sense to move the populate try/except work into the individual _populate methods? They could return any exception raised, or else None. Then the main _populate can just say
error_via_keep = self._populate_from_keep()
. It would at least cut down on some redundant code.That had occurred to me but I didn't get as far as convincing myself it was that much better. But now that you mention it, I've done it and I think it was worth it.
Great, thanks. One small thing: in _populate, please prefer is None
over == None
and is not None
over != None
.
- Having normalize()
return self
is really un-Pythonic. Methods that primarily mutate internal state conventionally return None (a.k.a. don't have a return) to signal exactly this to the caller. If we want our SDK to feel familiar to people familiar with Python, we should follow suit.
What if we added anormalize=False
argument to manifest_text()? That seems to be the primary use case.I've rearranged this…
Thank you, I think this is a much more natural fit. You missed one small spot: a change in the FUSE driver is still expecting normalize() to return the object, making the driver crash and tests fail.
- Since
--normalize
and--max-manifest-depth
are incompatible, let's put them into a mutually_exclusive_group in the parser, soarv-put
will generate a nice error message if the user specifies both.Good call. Done.
This is making me wonder, should --normalize
be the default behavior for arv-put, since it's been doing that all along? Someone who wanted neither normalization nor a max manifest depth could get that by saying --max-manifest-depth $BIGNUM
, or we could provide --no-normalize
. But I'm honestly not sure what the right answer is here, policy-wise.
A couple other small things:
- This branch removes the line
self.mock_get_collection(client, 404, None)
from test_locator_init_fallback_to_keep. However, the default setup method mocks the API client to return a 200 for a requested collection. If I'm following correctly, without this line, CollectionReader could request the collection from the API server and pass, which I think thwarts the test's intent. - The condition at the end of CollectionWriter.manifest_text() is now a noop and can be simplified to
return manifest
.
Thanks.
Updated by Tom Clegg about 10 years ago
Brett Smith wrote:
Great, thanks. One small thing: in _populate, please prefer
is None
over== None
andis not None
over!= None
.
Fixed. Also fixed a few other existing uses of != and ==, including one where the result of the comparison was thrown away anyway:
while now <= timeout: - os.kill(server_pid, signal.SIGTERM) == None + os.kill(server_pid, signal.SIGTERM)
Thank you, I think this is a much more natural fit. You missed one small spot: a change in the FUSE driver is still expecting normalize() to return the object, making the driver crash and tests fail.
Whoops. Fixed now. Maybe I'll run all the tests this time. :)
This is making me wonder, should
--normalize
be the default behavior for arv-put, since it's been doing that all along? Someone who wanted neither normalization nor a max manifest depth could get that by saying--max-manifest-depth $BIGNUM
, or we could provide--no-normalize
. But I'm honestly not sure what the right answer is here, policy-wise.
I've updated work_directory_tree so it always sorts entries. Unless I'm forgetting something, this is nearly1 equivalent to making arv-put default to --normalize
, except that the blocks now get written in the same order as the files. Previously, when relying on normalize(), although the files appeared in sorted order, the data blocks were written based on the arbitrary order in which the directory entries were returned by os.listdir()
.
This change means readers who read all_files() in the order given in the resulting manifest will read blocks sequentially, which in general will perform better.
1 I suppose the remaining difference is that "arv-put --normalize" allows you to specify multiple overlapping paths and get a manifest that (more or less) hides the fact that you did that: e.g., "arv-put foo/bar.txt foo/baz.txt" would have one stream instead of two, but it would have two data blocks even if the files are tiny.
- This branch removes the line
self.mock_get_collection(client, 404, None)
from test_locator_init_fallback_to_keep. However, the default setup method mocks the API client to return a 200 for a requested collection. If I'm following correctly, without this line, CollectionReader could request the collection from the API server and pass, which I think thwarts the test's intent.
Huh. I think I was expecting tutil.mock_responses(self.DEFAULT_MANIFEST, 404, 200)
to return 404 for collections.get and then 200 for the "get" from the Keep server. In any case, I've changed it back to do mock_get_collection(...404...)
followed by mock_responses(...200)
since there seems to have been no reason to change this in the first place.
- The condition at the end of CollectionWriter.manifest_text() is now a noop and can be simplified to
return manifest
.
Indeed, thanks. Fixed.
All tests passed before that very last thing. Re-running now just in case. But I think it will be fine. :)
Now at 411305d
Updated by Brett Smith about 10 years ago
Tom Clegg wrote:
Brett Smith wrote:
- This branch removes the line
self.mock_get_collection(client, 404, None)
from test_locator_init_fallback_to_keep. However, the default setup method mocks the API client to return a 200 for a requested collection. If I'm following correctly, without this line, CollectionReader could request the collection from the API server and pass, which I think thwarts the test's intent.Huh. I think I was expecting
tutil.mock_responses(self.DEFAULT_MANIFEST, 404, 200)
to return 404 for collections.get and then 200 for the "get" from the Keep server.
It might have that effect if the API client wasn't itself a mock. Maybe we use that strategy in another test suite?
411305d is good to merge, thanks.
Updated by Brett Smith about 10 years ago
I recanted on IRC. For posterity: I think _work_trees needs to say: max_depth = (None if max_manifest_depth < 0 else max_manifest_depth)
. You may also need to put some guards on CollectionWriter's internal handling of max_manifest_depth, to make sure it doesn't go from 0 to -1. Because listdir_recursive never returned directory entries before, that was never an issue.
Updated by Anonymous about 10 years ago
- Status changed from In Progress to Resolved
- % Done changed from 80 to 100
Applied in changeset arvados|commit:68dae835a2f8bebd5664211a60fc76d50e5ecc23.