Bug #21718
closedregression calling decode() on memoryview returned by ArvFile
Added by Peter Amstutz 9 months ago. Updated 8 months ago.
Description
Unfortunately there seems to be a regression in the 2.7.2 release that wasn't caught by our tests.
What seems to have happened is that the data type returned by the Python SDK when reading from Keep changed to avoid copying a buffer (as an optimization), but the methods available on the new type are slightly different. Specifically throwing an exception on the call to "decode()":
if "cwl.output.json" in outc: with outc.open("cwl.output.json", "rb") as f: if f.size() > 0: outputs = json.loads(f.read().decode())
Updated by Peter Amstutz 9 months ago
- Description updated (diff)
- Subject changed from regression calling encode() on memoryview returned by ArvFile to regression calling decode() on memoryview returned by ArvFile
Updated by Claudio Lorenzi 9 months ago
Hi!
If it can help, here https://github.com/K0lb3/UnityPy/pull/24/files#diff-6251546231e080376d5bd106e7470c7bde8a672c4af32703fd362b239857eebe they solved by adding a check before the decoding. In this case it would be something like:
if "cwl.output.json" in outc:
with outc.open("cwl.output.json", "rb") as f:
if f.size() > 0:
out_bytes = f.read()
if isinstance(name, memoryview):
out_bytes = name.tobytes()
outputs = json.loads(out_bytes.decode())
Updated by Peter Amstutz 9 months ago
- Target version changed from Development 2024-05-08 sprint to Development 2024-05-22 sprint
Updated by Peter Amstutz 8 months ago
21718-optional-memoryview @ 9876a78e50d1b3ad2571b6e77d92bcb2465d996e
Updated by Peter Amstutz 8 months ago
21718-optional-memoryview @ 6dd10f26c5f87f0571a721d3499e831bc421f5e0
Updated by Peter Amstutz 8 months ago
21718-optional-memoryview @ e566c811836e87c549306303953dcc8da126b301
- All agreed upon points are implemented / addressed.
- yes, this solves the regression
- Anything not implemented (discovered or discussed during work) has a follow-up story.
- I intend to post an additional branch on this story which solves the problem on the other side by using
str(data, 'utf-8')
instead of.decode('utf-8')
- I intend to post an additional branch on this story which solves the problem on the other side by using
- Code is tested and passing, both automated and manual, what manual testing was done is described
- passes unit tests and CWL test suite
- Documentation has been updated.
- restores default behavior, docstring API documentation describes the new flag.
- Behaves appropriately at the intended scale (describe intended scale).
- returning a bytes object is worse for performance because it involves a data copy, I intend to post an additional branch that flips the default for 3.0
- Considered backwards and forwards compatibility issues between client and server.
- this reverts the default behavior to pre-2.7.2
- Follows our coding standards and GUI style guidelines.
- yes
This issue here is that changes to the Python SDK had it returning a memoryview
object (the Python version of a zero-copy slice) instead of a bytes
object, and that code would convert it into a Python string using the .decode()
method which is present on bytes
but not on memoryview
.
My original plan was to replace the usage of .decode('utf-8')
with str(data, 'utf-8')
, but it turns out that requires making changes to cwltool and it's non-trivial to upgrade cwltool in Arvados 2.7 because cwltool dropped support for versions of Python that we're still supporting in Arvados 2.7.
My new plan is putting the behavior on a flag so that FUSE can select the new behavior and everyone else get the old behavior by default, which can go into a 2.7.3 release.
However for 3.0, I'd like to switch it to the new behavior by default, so I'll have a follow-up branch that switches the default and removes the use of decode()
.
Updated by Peter Amstutz 8 months ago
21718-memoryview-decode @ 8b273a91ca36b167d1d89ce5798584328df44b7e
This is the follow-up branch which fixes the issue on the other side, by replacing .decode('utf-8')
with str(data, 'utf-8')
.
Updated by Tom Clegg 8 months ago
21718-optional-memoryview @ e566c811836e87c549306303953dcc8da126b301
This looks like it will fix the regression, yay.
Why not add the new flag to the readfrom() call itself, such that it only affects the return value of that particular call to that function -- similar to the exact
flag? It seems like that would be easier to explain/follow, and avoid the little chain of function calls on each read to get the Collection-wide setting.
Also, I'm not sure "return bytes or memoryview, depending on Things" is a good default behavior even in 3.0. I'd be more inclined to leave it optional -- power users like arv-mount can get the optimization benefit but the default is easier to use correctly because it always returns the same type.
Updated by Peter Amstutz 8 months ago
21718-memoryview-readfrom @ d022c4d9dbd94ada1f9d0449cd7c9494b90bb518
New branch taking suggested approach of making it a flag on readfrom()
, I agree this is a less invasive fix.
This splits the difference by making it a flag on Collection.open() so it becomes a behavior of the specific file-like object, or in the case of FUSE, passed explicitly to the readfrom() call since FUSE calls ArvFile directly.
Updated by Peter Amstutz 8 months ago
The 21718-optional-memoryview
branch is abandoned.
21718-memoryview-readfrom
is the new one described in note-14. This branch also passes CWL tests. The intention is to merge and cherry-pick this one into 2.7.3.
21718-memoryview-decode
removes the usage of decode(), but because this requires upgrading cwltool, it isn't suitable for 2.7.3, but I think it is worth merging for 3.0.
Updated by Tom Clegg 8 months ago
82b47fbaeb66a790b10f830ede3119462e2e402e looks good -- adds the return_memoryview
option to read() and readfrom().
return_memoryview
flag from read() and adds it to open(), so now we have
readfrom(..., return_memoryview=True)
meaning "memoryview enabled on this call"read(..., return_memoryview=True)
error, not supportedopen(..., return_memoryview=True)
meaning "memoryview enabled on future calls to read()" (the pattern I complained about in #note-13)
What was the problem with adding the flag to read()? What is the anticipated use case for the "file-like object that doesn't quite match the io.RawIOBase interface because read() sometimes returns memoryview" feature?
If we really do need this, the docstring should at least explain that open(..., return_memoryview=True)
changes the behavior of read() on the returned object, but does not change the behavior of readfrom().
Tangentially, I noticed io.RawIOBase specifies read(size=-1)
means read until EOF, but our implementation doesn't seem to obey that. We might want to fix that at some point.
Updated by Peter Amstutz 8 months ago
21718-memoryview-readfrom @ 84366e3b482e1f649d7a16943bf825212db0a6b2
Made one additional change so that return_memoryview
always returns a memory view object, regardless of whether the underlying object is bytes
or mmap
.
Updated by Peter Amstutz 8 months ago
Tom Clegg wrote in #note-16:
82b47fbaeb66a790b10f830ede3119462e2e402e looks good -- adds the
But then d022c4d9dbd94ada1f9d0449cd7c9494b90bb518 removes thereturn_memoryview
option to read() and readfrom().return_memoryview
flag from read() and adds it to open(), so now we have
readfrom(..., return_memoryview=True)
meaning "memoryview enabled on this call"read(..., return_memoryview=True)
error, not supportedopen(..., return_memoryview=True)
meaning "memoryview enabled on future calls to read()" (the pattern I complained about in #note-13)What was the problem with adding the flag to read()? What is the anticipated use case for the "file-like object that doesn't quite match the io.RawIOBase interface because read() sometimes returns memoryview" feature?
If we really do need this, the docstring should at least explain that
open(..., return_memoryview=True)
changes the behavior of read() on the returned object, but does not change the behavior of readfrom().
There are two instances of readfrom
, ArvadosFileReader.readfrom()
and ArvadosFile.readfrom()
.
The Collection.open()
call returns ArvadosFileReader/ArvadosFileWriter
, those are file-like objects. Those should support normal file APIs.
They wrap ArvadosFile
which is not a file-like object.
The reason to make return_memoryview
a flag of ArvadosFileReader/ArvadosFileWriter
passed on Collection.open()
is so that if you open a file, then pass it to a 3rd party method, and the 3rd party method calls read()
or readfrom()
on the Collection file and the 3rd party method is able to accept memory view objects, you don't have to change the code of that 3rd party method to add read(return_memoryview=True)
.
Tangentially, I noticed io.RawIOBase specifies
read(size=-1)
means read until EOF, but our implementation doesn't seem to obey that. We might want to fix that at some point.
Yea, there's probably a few methods that we're missing, but it hasn't come up.
Updated by Tom Clegg 8 months ago
Peter Amstutz wrote in #note-18:
The reason to make
return_memoryview
a flag ofArvadosFileReader/ArvadosFileWriter
passed onCollection.open()
is so that if you open a file, then pass it to a 3rd party method, and the 3rd party method callsread()
orreadfrom()
on the Collection file and the 3rd party method is able to accept memory view objects, you don't have to change the code of that 3rd party method to addread(return_memoryview=True)
.
A future version of the 3rd party code will choke on memoryview, so this is a bad idea. If someone decides to do it anyway, they can wrap/monkeypatch the ArvadosFileReader object themselves and be hoist with their own petard. It doesn't need to be an Arvados feature.
Updated by Peter Amstutz 8 months ago
21718-memoryview-readfrom-v2 @ a41610d3f36852e9abf7eaa03e91d170c1c441b4
Updated by Tom Clegg 8 months ago
- stray trailing
)
in thereturn_memoryview
para in the docstring forArvadosFile.readfrom()
(and might as well upgrade the en-dash to an em-dash"---"
to match "exact") - maybe the degenerate case
return b''
at the top of read from() should also return amemoryview(b'')
if the flag is on? - maybe update the docstring for
ArvadosFileReader.read()
to mention that size=-1 means read all (maybe change the default to size=-1 too, just to make it look more standard in the docs?)
Thanks!
Updated by Peter Amstutz 8 months ago
21718-memoryview-readfrom-v2 @ 04aca3328c7fbced6e50d7c8a35e0004baf03568
Updated by Peter Amstutz 8 months ago
- Target version changed from Development 2024-05-22 sprint to Development 2024-06-05 sprint
Updated by Peter Amstutz 8 months ago
- Status changed from In Progress to Resolved