Project

General

Profile

Actions

Bug #4363

closed

[Keep] arv-mount should not munge filenames unless required by POSIX

Added by Tom Clegg over 9 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Keep
Target version:
Story points:
0.5

Description

One particularly unfortunate outcome of the current "strip characters we think you shouldn't use" code is:

echo foo | arv-put -
# filename is "-" 

ls -l /keep/by_id/{id}/
# --> I/O Error
Other problems include
  • A file called "Foo" alongside a file called "-Foo" → undefined behavior
  • Files with name A in one place and name B in other places → confusing

Subtasks 5 (0 open5 closed)

Task #4365: Ensure posix-incompatible filenames do not cause I/O errors (add tests)ResolvedTom Clegg11/12/2014Actions
Task #4375: Add testsResolvedTom Clegg11/12/2014Actions
Task #4364: Ensure posix-allowed filenames are not altered by arv-mount (add tests)ResolvedTom Clegg11/12/2014Actions
Task #4510: Test some utf8 charactersResolvedTom Clegg11/12/2014Actions
Task #4517: Review 4363-less-filename-mungingResolvedBrett Smith11/12/2014Actions

Related issues

Related to Arvados - Idea #4551: [Workbench] [API] Support UTF-8 filenames and stream names in all manifest-handling code.ResolvedActions
Actions #1

Updated by Tom Clegg over 9 years ago

  • Description updated (diff)
  • Category set to Keep
Actions #2

Updated by Tom Clegg over 9 years ago

  • Assigned To set to Tom Clegg
Actions #3

Updated by Tom Clegg over 9 years ago

  • Status changed from New to In Progress
Actions #4

Updated by Tom Clegg over 9 years ago

At e8dd6d9

Notes
  • NUL bytes in filenames are now prohibited by CollectionWriter (this seemed more reasonable than chasing the bug that makes arv-mount choke on NUL before getting to sanitize_filename()).
  • Scope expanded a little to support (at least some) utf-8-encoded characters.
    • Sailboats ⛵ and victory signs ✌ work.
    • The "scream" emoticon 😱 doesn't work. U+1f631 "\xf0\x9f\x98\xb1" in a collection shows up in os.listdir as U+f631 "\xef\x98\xb1". At this point I decided not to get hung up on it.
    • Decided manifest_text normally stays utf-8-encoded because it's a storage format, while Python methods like name() deal in unicode strings. However, CollectionReader checks whether the manifest_text is unicode, to avoid double-decoding.
Actions #5

Updated by Brett Smith over 9 years ago

Tom Clegg wrote:

At e8dd6d9

Okay, so, I'm glad to have some clarity around the encoding of manifests, but you realize it's going to take a lot more to get the Python SDK fully usable with the limitation to UTF-8, right? Just by reading the code, I can see that CollectionWriter.manifest_text, StreamFileReader.as_manifest, and both collection open methods (at least their exception messages) need to start dealing with Unicode strings to work consistently with the other changes in this branch—basically anywhere we use str.format in the context of streams is suspect. Ideally, we'd have tests for all these cases, too. I realize that's a much bigger project than was originally planned for this branch, but, well, declaring "Manifests must be UTF-8" is inevitably going to have far-reaching consequences. And this isn't even worrying about whether other tools like crunch-job, run-command, etc. are prepared to cope.

If you want to go ahead:

  • Please document that manifests must be UTF-8.
  • Should we add a validation to the API server to make sure manifests actually are UTF-8?
  • Prefer isinstance(self._manifest_text, unicode) over type(self._manifest_text) == unicode, to properly accept subclasses.
Actions #6

Updated by Tom Clegg over 9 years ago

Brett Smith wrote:

Okay, so, I'm glad to have some clarity around the encoding of manifests, but you realize it's going to take a lot more to get the Python SDK fully usable with the limitation to UTF-8, right?

My goal here was really just to prevent arv-mount from failing completely ("collection is empty!") when encountering a UTF-8 character in a manifest. Having everything work with UTF-8 characters sounds even better. Added #4551 for that.

  • Please document that manifests must be UTF-8.

Added a few words at Keep manifest format

  • Should we add a validation to the API server to make sure manifests actually are UTF-8?

Done. Added unit tests.

  • Prefer isinstance(self._manifest_text, unicode) over type(self._manifest_text) == unicode, to properly accept subclasses.

Oops. Fixed.

Now at 9cd3ddf

Actions #7

Updated by Brett Smith over 9 years ago

Tom Clegg wrote:

Brett Smith wrote:

Okay, so, I'm glad to have some clarity around the encoding of manifests, but you realize it's going to take a lot more to get the Python SDK fully usable with the limitation to UTF-8, right?

My goal here was really just to prevent arv-mount from failing completely ("collection is empty!") when encountering a UTF-8 character in a manifest. Having everything work with UTF-8 characters sounds even better. Added #4551 for that.

I want to make sure we're on the same page about what this means: code that previously used UTF-8 and ran fine (by passing raw bytes to all the Arvados objects) will now crash because CollectionReader internally decodes the UTF-8 into Unicode strings, and the rest of the SDK doesn't cope with those. For example, given this collection:

>>> filename = u'☺'.encode('utf-8')
>>> cw = arvados.CollectionWriter()
>>> with cw.open(filename) as f:
...     f.write('test text')
...
>>> cw.finish()
'f6426d298c7c0d83ff23ee539d8f5d06+45'

Here's a common idiom for file transformation in our Crunch scripts that now breaks:

>>> cr = arvados.CollectionReader('f6426d298c7c0d83ff23ee539d8f5d06+45')
>>> input_file = next(cr.all_files())
>>> out = arvados.CollectionWriter()
>>> out.set_current_file_name(input_file.name())
>>> out.set_current_stream_name(input_file.stream_name())
>>> for text in input_file.readall():
...     out.write(text.upper())
>>> out.finish()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "arvados/collection.py", line 544, in finish
    return self._my_keep().put(self.manifest_text())
  File "arvados/keep.py", line 725, in local_store_put
    md5 = hashlib.md5(data).hexdigest()
UnicodeEncodeError: 'ascii' codec can't encode character u'\u263a' in position 41: ordinal not in range(128)

This works fine under master right now. You're OK with breaking it? Bear in mind that any Crunch job using run-command that output a UTF-8 filename is in this class of users, and may not realize it.

Fixing only the parts of the SDK that arv-mount happens to use seems really rough for SDK usability, too. How are users supposed to know, or even guess, that they should pass Unicode strings to CollectionReader.open, but raw bytes to CollectionWriter.open? I'm worried that if we appear to treat the SDK as our internal tool library, only fixing bugs as much as necessary to fix the tools, others will hesitate to write code on top of it.

I know that "Manifests are consistently handled Unicode-aware across the Python SDK" is not the job you signed up for, but personally, I think that fixing CollectionReader alone is likely to create a lot of new bugs, and I'm not sold that fixing the bug for arv-mount is important or urgent enough to justify that trouble.

  • Should we add a validation to the API server to make sure manifests actually are UTF-8?

Done. Added unit tests.

Do you know what sets the string encoding in Rails? In bare Ruby, it seems to be the locale:

brinstar % ruby -e 'p "".encoding'
#<Encoding:UTF-8>
brinstar % LC_ALL=C ruby -e 'p "".encoding'
#<Encoding:US-ASCII>

I haven't successfully used this information for evil in Rails (read: broken your tests), but I wonder if the tests are subtly dependent on the system setup, similar to our database collation issue earlier.

Since we're introspecting the declared encoding, should the validation detect and accept plain ASCII?

Actions #8

Updated by Tom Clegg over 9 years ago

Brett Smith wrote:

I want to make sure we're on the same page about what this means: code that previously used UTF-8 and ran fine (by passing raw bytes to all the Arvados objects) will now crash because CollectionReader internally decodes the UTF-8 into Unicode strings, and the rest of the SDK doesn't cope with those. For example, given this collection:

Well, I don't think it matters too much exactly where in the chain UTF-8 filenames break, if they break. That said, it probably does make more sense to tackle the encoding issue properly instead of just solving one specific scenario.

This works fine under master right now. You're OK with breaking it? Bear in mind that any Crunch job using run-command that output a UTF-8 filename is in this class of users, and may not realize it.

The fact that collection validations currently fail on multibyte characters (because length != bytesize) makes me suspect this class is empty... but still, point taken.

I know that "Manifests are consistently handled Unicode-aware across the Python SDK" is not the job you signed up for, but personally, I think that fixing CollectionReader alone is likely to create a lot of new bugs, and I'm not sold that fixing the bug for arv-mount is important or urgent enough to justify that trouble.

Added #4551. Split this branch into
  • 4363-less-filename-munging, without the UTF-8 changes on the Python side, but with the API side validations.
  • 4551-utf-8

I haven't successfully used this information for evil in Rails (read: broken your tests), but I wonder if the tests are subtly dependent on the system setup, similar to our database collation issue earlier.

Since we're introspecting the declared encoding, should the validation detect and accept plain ASCII?

Indeed. I added some explicit encodings in the tests, and added a "transcode to UTF-8 and see if it's identical" step to the validation, which covers a more general case.

Now at 9f52191. Tests pass with LC_ALL=en_CA.UTF-8 and with LC_ALL=C.

(Waiting for full run-tests now)

Actions #9

Updated by Brett Smith over 9 years ago

Tom Clegg wrote:

Now at 9f52191.

This commit is good to merge. Thanks.

Actions #10

Updated by Anonymous over 9 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 88 to 100

Applied in changeset arvados|commit:3735d52b65928e3626a8e223acadc318a3d31097.

Actions

Also available in: Atom PDF