Project

General

Profile

Actions

Bug #21718

closed

regression calling decode() on memoryview returned by ArvFile

Added by Peter Amstutz 9 months ago. Updated 8 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
CWL
Story points:
-
Release relationship:
Auto

Description

Unfortunately there seems to be a regression in the 2.7.2 release that wasn't caught by our tests.

What seems to have happened is that the data type returned by the Python SDK when reading from Keep changed to avoid copying a buffer (as an optimization), but the methods available on the new type are slightly different. Specifically throwing an exception on the call to "decode()":

            if "cwl.output.json" in outc:
                with outc.open("cwl.output.json", "rb") as f:
                    if f.size() > 0:
                        outputs = json.loads(f.read().decode())

Subtasks 4 (0 open4 closed)

Task #21727: Review 21718-optional-memoryview ClosedPeter Amstutz05/20/2024Actions
Task #21778: Review 21718-memoryview-readfromClosedTom Clegg05/21/2024Actions
Task #21779: Review 21718-memoryview-decodeResolvedTom Clegg05/22/2024Actions
Task #21812: Review 21718-memoryview-readfrom-v2 ResolvedPeter Amstutz05/22/2024Actions
Actions #1

Updated by Peter Amstutz 9 months ago

  • Description updated (diff)
  • Subject changed from regression calling encode() on memoryview returned by ArvFile to regression calling decode() on memoryview returned by ArvFile
Actions #2

Updated by Peter Amstutz 9 months ago

  • Assigned To set to Peter Amstutz
Actions #5

Updated by Claudio Lorenzi 9 months ago

Hi!
If it can help, here https://github.com/K0lb3/UnityPy/pull/24/files#diff-6251546231e080376d5bd106e7470c7bde8a672c4af32703fd362b239857eebe they solved by adding a check before the decoding. In this case it would be something like:

if "cwl.output.json" in outc:
   with outc.open("cwl.output.json", "rb") as f:
     if f.size() > 0:
       out_bytes = f.read()
       if isinstance(name, memoryview):
           out_bytes = name.tobytes()
       outputs = json.loads(out_bytes.decode())

Actions #6

Updated by Peter Amstutz 9 months ago

  • Status changed from New to In Progress
Actions #7

Updated by Peter Amstutz 9 months ago

  • Target version changed from Development 2024-05-08 sprint to Development 2024-05-22 sprint
Actions #8

Updated by Peter Amstutz 8 months ago

  • Release set to 71
Actions #11

Updated by Peter Amstutz 8 months ago

21718-optional-memoryview @ e566c811836e87c549306303953dcc8da126b301

developer-run-tests: #4233

  • All agreed upon points are implemented / addressed.
    • yes, this solves the regression
  • Anything not implemented (discovered or discussed during work) has a follow-up story.
    • I intend to post an additional branch on this story which solves the problem on the other side by using str(data, 'utf-8') instead of .decode('utf-8')
  • Code is tested and passing, both automated and manual, what manual testing was done is described
    • passes unit tests and CWL test suite
  • Documentation has been updated.
    • restores default behavior, docstring API documentation describes the new flag.
  • Behaves appropriately at the intended scale (describe intended scale).
    • returning a bytes object is worse for performance because it involves a data copy, I intend to post an additional branch that flips the default for 3.0
  • Considered backwards and forwards compatibility issues between client and server.
    • this reverts the default behavior to pre-2.7.2
  • Follows our coding standards and GUI style guidelines.
    • yes

This issue here is that changes to the Python SDK had it returning a memoryview object (the Python version of a zero-copy slice) instead of a bytes object, and that code would convert it into a Python string using the .decode() method which is present on bytes but not on memoryview.

My original plan was to replace the usage of .decode('utf-8') with str(data, 'utf-8'), but it turns out that requires making changes to cwltool and it's non-trivial to upgrade cwltool in Arvados 2.7 because cwltool dropped support for versions of Python that we're still supporting in Arvados 2.7.

My new plan is putting the behavior on a flag so that FUSE can select the new behavior and everyone else get the old behavior by default, which can go into a 2.7.3 release.

However for 3.0, I'd like to switch it to the new behavior by default, so I'll have a follow-up branch that switches the default and removes the use of decode().

Actions #12

Updated by Peter Amstutz 8 months ago

21718-memoryview-decode @ 8b273a91ca36b167d1d89ce5798584328df44b7e

developer-run-tests: #4232

This is the follow-up branch which fixes the issue on the other side, by replacing .decode('utf-8') with str(data, 'utf-8').

Actions #13

Updated by Tom Clegg 8 months ago

21718-optional-memoryview @ e566c811836e87c549306303953dcc8da126b301

This looks like it will fix the regression, yay.

Why not add the new flag to the readfrom() call itself, such that it only affects the return value of that particular call to that function -- similar to the exact flag? It seems like that would be easier to explain/follow, and avoid the little chain of function calls on each read to get the Collection-wide setting.

Also, I'm not sure "return bytes or memoryview, depending on Things" is a good default behavior even in 3.0. I'd be more inclined to leave it optional -- power users like arv-mount can get the optimization benefit but the default is easier to use correctly because it always returns the same type.

Actions #14

Updated by Peter Amstutz 8 months ago

21718-memoryview-readfrom @ d022c4d9dbd94ada1f9d0449cd7c9494b90bb518

developer-run-tests: #4236

New branch taking suggested approach of making it a flag on readfrom(), I agree this is a less invasive fix.

This splits the difference by making it a flag on Collection.open() so it becomes a behavior of the specific file-like object, or in the case of FUSE, passed explicitly to the readfrom() call since FUSE calls ArvFile directly.

Actions #15

Updated by Peter Amstutz 8 months ago

The 21718-optional-memoryview branch is abandoned.

21718-memoryview-readfrom is the new one described in note-14. This branch also passes CWL tests. The intention is to merge and cherry-pick this one into 2.7.3.

21718-memoryview-decode removes the usage of decode(), but because this requires upgrading cwltool, it isn't suitable for 2.7.3, but I think it is worth merging for 3.0.

Actions #16

Updated by Tom Clegg 8 months ago

82b47fbaeb66a790b10f830ede3119462e2e402e looks good -- adds the return_memoryview option to read() and readfrom().

But then d022c4d9dbd94ada1f9d0449cd7c9494b90bb518 removes the return_memoryview flag from read() and adds it to open(), so now we have
  • readfrom(..., return_memoryview=True) meaning "memoryview enabled on this call"
  • read(..., return_memoryview=True) error, not supported
  • open(..., return_memoryview=True) meaning "memoryview enabled on future calls to read()" (the pattern I complained about in #note-13)

What was the problem with adding the flag to read()? What is the anticipated use case for the "file-like object that doesn't quite match the io.RawIOBase interface because read() sometimes returns memoryview" feature?

If we really do need this, the docstring should at least explain that open(..., return_memoryview=True) changes the behavior of read() on the returned object, but does not change the behavior of readfrom().

Tangentially, I noticed io.RawIOBase specifies read(size=-1) means read until EOF, but our implementation doesn't seem to obey that. We might want to fix that at some point.

Actions #17

Updated by Peter Amstutz 8 months ago

21718-memoryview-readfrom @ 84366e3b482e1f649d7a16943bf825212db0a6b2

developer-run-tests: #4237

Made one additional change so that return_memoryview always returns a memory view object, regardless of whether the underlying object is bytes or mmap.

Actions #18

Updated by Peter Amstutz 8 months ago

Tom Clegg wrote in #note-16:

82b47fbaeb66a790b10f830ede3119462e2e402e looks good -- adds the return_memoryview option to read() and readfrom().

But then d022c4d9dbd94ada1f9d0449cd7c9494b90bb518 removes the return_memoryview flag from read() and adds it to open(), so now we have
  • readfrom(..., return_memoryview=True) meaning "memoryview enabled on this call"
  • read(..., return_memoryview=True) error, not supported
  • open(..., return_memoryview=True) meaning "memoryview enabled on future calls to read()" (the pattern I complained about in #note-13)

What was the problem with adding the flag to read()? What is the anticipated use case for the "file-like object that doesn't quite match the io.RawIOBase interface because read() sometimes returns memoryview" feature?

If we really do need this, the docstring should at least explain that open(..., return_memoryview=True) changes the behavior of read() on the returned object, but does not change the behavior of readfrom().

There are two instances of readfrom, ArvadosFileReader.readfrom() and ArvadosFile.readfrom().

The Collection.open() call returns ArvadosFileReader/ArvadosFileWriter, those are file-like objects. Those should support normal file APIs.

They wrap ArvadosFile which is not a file-like object.

The reason to make return_memoryview a flag of ArvadosFileReader/ArvadosFileWriter passed on Collection.open() is so that if you open a file, then pass it to a 3rd party method, and the 3rd party method calls read() or readfrom() on the Collection file and the 3rd party method is able to accept memory view objects, you don't have to change the code of that 3rd party method to add read(return_memoryview=True).

Tangentially, I noticed io.RawIOBase specifies read(size=-1) means read until EOF, but our implementation doesn't seem to obey that. We might want to fix that at some point.

Yea, there's probably a few methods that we're missing, but it hasn't come up.

Actions #19

Updated by Tom Clegg 8 months ago

Peter Amstutz wrote in #note-18:

The reason to make return_memoryview a flag of ArvadosFileReader/ArvadosFileWriter passed on Collection.open() is so that if you open a file, then pass it to a 3rd party method, and the 3rd party method calls read() or readfrom() on the Collection file and the 3rd party method is able to accept memory view objects, you don't have to change the code of that 3rd party method to add read(return_memoryview=True).

A future version of the 3rd party code will choke on memoryview, so this is a bad idea. If someone decides to do it anyway, they can wrap/monkeypatch the ArvadosFileReader object themselves and be hoist with their own petard. It doesn't need to be an Arvados feature.

Actions #21

Updated by Tom Clegg 8 months ago

21718-memoryview-readfrom-v2 LGTM, just a few nitpicks
  • stray trailing ) in the return_memoryview para in the docstring for ArvadosFile.readfrom() (and might as well upgrade the en-dash to an em-dash "---" to match "exact")
  • maybe the degenerate case return b'' at the top of read from() should also return a memoryview(b'') if the flag is on?
  • maybe update the docstring for ArvadosFileReader.read() to mention that size=-1 means read all (maybe change the default to size=-1 too, just to make it look more standard in the docs?)

Thanks!

Actions #23

Updated by Peter Amstutz 8 months ago

  • Target version changed from Development 2024-05-22 sprint to Development 2024-06-05 sprint
Actions #24

Updated by Tom Clegg 8 months ago

LGTM, thanks!

Actions #25

Updated by Peter Amstutz 8 months ago

  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF