Bug #4363
closed[Keep] arv-mount should not munge filenames unless required by POSIX
Description
One particularly unfortunate outcome of the current "strip characters we think you shouldn't use" code is:
echo foo | arv-put - # filename is "-" ls -l /keep/by_id/{id}/ # --> I/O ErrorOther problems include
- A file called "Foo" alongside a file called "-Foo" → undefined behavior
- Files with name A in one place and name B in other places → confusing
Updated by Tom Clegg about 10 years ago
- Description updated (diff)
- Category set to Keep
Updated by Tom Clegg about 10 years ago
At e8dd6d9
Notes- NUL bytes in filenames are now prohibited by CollectionWriter (this seemed more reasonable than chasing the bug that makes arv-mount choke on NUL before getting to
sanitize_filename()
). - Scope expanded a little to support (at least some) utf-8-encoded characters.
- Sailboats ⛵ and victory signs ✌ work.
- The "scream" emoticon 😱 doesn't work.
U+1f631 "\xf0\x9f\x98\xb1"
in a collection shows up inos.listdir
asU+f631 "\xef\x98\xb1"
. At this point I decided not to get hung up on it. - Decided manifest_text normally stays utf-8-encoded because it's a storage format, while Python methods like name() deal in unicode strings. However,
CollectionReader
checks whether the manifest_text is unicode, to avoid double-decoding.
Updated by Brett Smith about 10 years ago
Tom Clegg wrote:
At e8dd6d9
Okay, so, I'm glad to have some clarity around the encoding of manifests, but you realize it's going to take a lot more to get the Python SDK fully usable with the limitation to UTF-8, right? Just by reading the code, I can see that CollectionWriter.manifest_text, StreamFileReader.as_manifest, and both collection open methods (at least their exception messages) need to start dealing with Unicode strings to work consistently with the other changes in this branch—basically anywhere we use str.format
in the context of streams is suspect. Ideally, we'd have tests for all these cases, too. I realize that's a much bigger project than was originally planned for this branch, but, well, declaring "Manifests must be UTF-8" is inevitably going to have far-reaching consequences. And this isn't even worrying about whether other tools like crunch-job, run-command, etc. are prepared to cope.
If you want to go ahead:
- Please document that manifests must be UTF-8.
- Should we add a validation to the API server to make sure manifests actually are UTF-8?
- Prefer
isinstance(self._manifest_text, unicode)
overtype(self._manifest_text) == unicode
, to properly accept subclasses.
Updated by Tom Clegg about 10 years ago
Brett Smith wrote:
Okay, so, I'm glad to have some clarity around the encoding of manifests, but you realize it's going to take a lot more to get the Python SDK fully usable with the limitation to UTF-8, right?
My goal here was really just to prevent arv-mount from failing completely ("collection is empty!") when encountering a UTF-8 character in a manifest. Having everything work with UTF-8 characters sounds even better. Added #4551 for that.
- Please document that manifests must be UTF-8.
Added a few words at Keep manifest format
- Should we add a validation to the API server to make sure manifests actually are UTF-8?
Done. Added unit tests.
- Prefer
isinstance(self._manifest_text, unicode)
overtype(self._manifest_text) == unicode
, to properly accept subclasses.
Oops. Fixed.
Now at 9cd3ddf
Updated by Brett Smith about 10 years ago
Tom Clegg wrote:
Brett Smith wrote:
Okay, so, I'm glad to have some clarity around the encoding of manifests, but you realize it's going to take a lot more to get the Python SDK fully usable with the limitation to UTF-8, right?
My goal here was really just to prevent arv-mount from failing completely ("collection is empty!") when encountering a UTF-8 character in a manifest. Having everything work with UTF-8 characters sounds even better. Added #4551 for that.
I want to make sure we're on the same page about what this means: code that previously used UTF-8 and ran fine (by passing raw bytes to all the Arvados objects) will now crash because CollectionReader internally decodes the UTF-8 into Unicode strings, and the rest of the SDK doesn't cope with those. For example, given this collection:
>>> filename = u'☺'.encode('utf-8') >>> cw = arvados.CollectionWriter() >>> with cw.open(filename) as f: ... f.write('test text') ... >>> cw.finish() 'f6426d298c7c0d83ff23ee539d8f5d06+45'
Here's a common idiom for file transformation in our Crunch scripts that now breaks:
>>> cr = arvados.CollectionReader('f6426d298c7c0d83ff23ee539d8f5d06+45') >>> input_file = next(cr.all_files()) >>> out = arvados.CollectionWriter() >>> out.set_current_file_name(input_file.name()) >>> out.set_current_stream_name(input_file.stream_name()) >>> for text in input_file.readall(): ... out.write(text.upper()) >>> out.finish() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "arvados/collection.py", line 544, in finish return self._my_keep().put(self.manifest_text()) File "arvados/keep.py", line 725, in local_store_put md5 = hashlib.md5(data).hexdigest() UnicodeEncodeError: 'ascii' codec can't encode character u'\u263a' in position 41: ordinal not in range(128)
This works fine under master right now. You're OK with breaking it? Bear in mind that any Crunch job using run-command that output a UTF-8 filename is in this class of users, and may not realize it.
Fixing only the parts of the SDK that arv-mount happens to use seems really rough for SDK usability, too. How are users supposed to know, or even guess, that they should pass Unicode strings to CollectionReader.open, but raw bytes to CollectionWriter.open? I'm worried that if we appear to treat the SDK as our internal tool library, only fixing bugs as much as necessary to fix the tools, others will hesitate to write code on top of it.
I know that "Manifests are consistently handled Unicode-aware across the Python SDK" is not the job you signed up for, but personally, I think that fixing CollectionReader alone is likely to create a lot of new bugs, and I'm not sold that fixing the bug for arv-mount is important or urgent enough to justify that trouble.
- Should we add a validation to the API server to make sure manifests actually are UTF-8?
Done. Added unit tests.
Do you know what sets the string encoding in Rails? In bare Ruby, it seems to be the locale:
brinstar % ruby -e 'p "".encoding' #<Encoding:UTF-8> brinstar % LC_ALL=C ruby -e 'p "".encoding' #<Encoding:US-ASCII>
I haven't successfully used this information for evil in Rails (read: broken your tests), but I wonder if the tests are subtly dependent on the system setup, similar to our database collation issue earlier.
Since we're introspecting the declared encoding, should the validation detect and accept plain ASCII?
Updated by Tom Clegg about 10 years ago
Brett Smith wrote:
I want to make sure we're on the same page about what this means: code that previously used UTF-8 and ran fine (by passing raw bytes to all the Arvados objects) will now crash because CollectionReader internally decodes the UTF-8 into Unicode strings, and the rest of the SDK doesn't cope with those. For example, given this collection:
Well, I don't think it matters too much exactly where in the chain UTF-8 filenames break, if they break. That said, it probably does make more sense to tackle the encoding issue properly instead of just solving one specific scenario.
This works fine under master right now. You're OK with breaking it? Bear in mind that any Crunch job using run-command that output a UTF-8 filename is in this class of users, and may not realize it.
The fact that collection validations currently fail on multibyte characters (because length != bytesize) makes me suspect this class is empty... but still, point taken.
Added #4551. Split this branch intoI know that "Manifests are consistently handled Unicode-aware across the Python SDK" is not the job you signed up for, but personally, I think that fixing CollectionReader alone is likely to create a lot of new bugs, and I'm not sold that fixing the bug for arv-mount is important or urgent enough to justify that trouble.
- 4363-less-filename-munging, without the UTF-8 changes on the Python side, but with the API side validations.
- 4551-utf-8
I haven't successfully used this information for evil in Rails (read: broken your tests), but I wonder if the tests are subtly dependent on the system setup, similar to our database collation issue earlier.
Since we're introspecting the declared encoding, should the validation detect and accept plain ASCII?
Indeed. I added some explicit encodings in the tests, and added a "transcode to UTF-8 and see if it's identical" step to the validation, which covers a more general case.
Now at 9f52191. Tests pass with LC_ALL=en_CA.UTF-8
and with LC_ALL=C
.
(Waiting for full run-tests now)
Updated by Brett Smith about 10 years ago
Updated by Anonymous about 10 years ago
- Status changed from In Progress to Resolved
- % Done changed from 88 to 100
Applied in changeset arvados|commit:3735d52b65928e3626a8e223acadc318a3d31097.