Project

General

Profile

Actions

Bug #14539

closed

[SDKs] [arv-mount] Use "." placeholder to persist empty directories

Added by Tom Clegg over 5 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
SDKs
Target version:
Story points:
-
Release relationship:
Auto

Description

When saving a collection, encode an empty directory like this:

./dir/emptysubdir d41d8cd98f00b204e9800998ecf8427e+0 0:0:\056

Subtasks 1 (0 open1 closed)

Task #14541: Review 14539-pysdk-empty-dirlResolvedLucas Di Pentima12/18/2018Actions

Related issues

Related to Arvados - Bug #14806: [crunch1] unescape filenames when parsing manifests in crunch-jobResolvedTom Clegg02/05/2019Actions
Actions #1

Updated by Tom Clegg over 5 years ago

  • Target version changed from Arvados Future Sprints to 2018-12-12 Sprint
Actions #2

Updated by Tom Morris over 5 years ago

  • Target version changed from 2018-12-12 Sprint to 2018-11-28 Sprint
Actions #3

Updated by Tom Morris over 5 years ago

  • Target version changed from 2018-11-28 Sprint to 2018-12-12 Sprint
Actions #5

Updated by Lucas Di Pentima over 5 years ago

  • Assigned To set to Lucas Di Pentima
Actions #6

Updated by Lucas Di Pentima over 5 years ago

  • Target version changed from 2018-12-12 Sprint to 2018-12-21 Sprint
Actions #7

Updated by Lucas Di Pentima over 5 years ago

  • Status changed from New to In Progress
Actions #8

Updated by Lucas Di Pentima over 5 years ago

Updates at 56959ea8492ec4f08aa90c89a8f41e5c278e3c41 - branch 14539-pysdk-empty-dir
Test run: https://ci.curoverse.com/job/developer-run-tests/1013/

  • Persist empty subdirectories by adding an empty file named \056 (".") to the manifest.
  • Don't allow explicitly use that name on file or directory names.
  • Test fixes & additions.
Actions #10

Updated by Tom Clegg over 5 years ago

The "save empty dir" part LGTM but I would like the test to include an empty directory whose name needs escaping, like "/foo bar/baz waz".

This part doesn't seem right. (Matching a specific encoding like \056 is a clue encoding/decoding is not being done properly.)

diff --git a/sdk/python/arvados/collection.py b/sdk/python/arvados/collection.py
index 627f0346d..c2517c618 100644
--- a/sdk/python/arvados/collection.py
+++ b/sdk/python/arvados/collection.py
@@ -600,6 +600,9 @@ class RichCollectionBase(CollectionBase):

         pathcomponents = path.split("/", 1)
         if pathcomponents[0]:
+            # Don't allow naming files/dirs \\056
+            if pathcomponents[0] == "\\056":
+                raise IOError(errno.EINVAL, "Invalid name", pathcomponents[0])
             item = self._items.get(pathcomponents[0])
             if len(pathcomponents) == 1:
                 if item is None:

Creating files and directories named '\056' should work fine. The manifest would look like this:

./\134056 d41d8cd98f00b204e9800998ecf8427e+0 0:0:\134056

The specific case of '\056' might be rare enough not to care about, but this seems to be revealing that something isn't escaping/unescaping backslashes properly. Indeed, a file named \040 seems to be written as \040 and then read back as " ", and this causes arv-mount to see its own updates as conflicts:

tomclegg@shell.4xphq:~/keepw/home/test$ touch '\040'
tomclegg@shell.4xphq:~/keepw/home/test$ ls -l
total 1
-rwxrwxrwx 1 tomclegg tomclegg 0 Dec 18 16:35 \040
-rwxrwxrwx 1 tomclegg tomclegg 0 Dec 18 16:35  ~20181218-163603~conflict~
tomclegg@shell.4xphq:~/keepw/home/test$ touch '\040'
tomclegg@shell.4xphq:~/keepw/home/test$ touch '\040'
tomclegg@shell.4xphq:~/keepw/home/test$ ls -l
total 2
-rwxrwxrwx 1 tomclegg tomclegg 0 Dec 18 16:35 \040
-rwxrwxrwx 1 tomclegg tomclegg 0 Dec 18 16:35  ~20181218-163603~conflict~
-rwxrwxrwx 1 tomclegg tomclegg 0 Dec 18 16:35  ~20181218-163644~conflict~
-rwxrwxrwx 1 tomclegg tomclegg 0 Dec 18 16:35  ~20181218-163656~conflict~

We should fix that instead of disallowing files/dirs named \056.

Actions #11

Updated by Lucas Di Pentima over 5 years ago

Ok, so after lots of testing and hair pulling, I think I got it to work. Squashed the many test commits by using rebase -i:

Update at 13e7ad8135a0bafc3d1d225ff7e4c62de4f3b43f
Test run: https://ci.curoverse.com/job/developer-run-tests/1020/

  • Moved the escaping to _normalize_stream(), doing it on find_or_create() was not right.
  • Removed code prohibiting to use '\056' as a file/dir name.
  • Updated tests.
Actions #13

Updated by Lucas Di Pentima over 5 years ago

  • Target version changed from 2018-12-21 Sprint to 2019-01-16 Sprint
Actions #14

Updated by Tom Clegg over 5 years ago

LGTM, thanks.

A couple of follow-ups, though:

This comment in FuseRmTest is now obsolete:

         # Can't have empty directories :-( so manifest will be empty.

Along with space and backslash, are there other chars that are legal in filenames but would be misinterpreted in a manifest? I suspect newline needs to be escaped, and we should probably do tab too (even if tab works unescaped, it could be distracting while troubleshooting).

Actions #15

Updated by Lucas Di Pentima over 5 years ago

Update at 735143cbf
Test run: https://ci.curoverse.com/job/developer-run-tests/1032/

Added escaping for \n and \t on file and stream names. Just in case I tested the use of : on file names and it seemed to me that it doesn't need escaping, no issues observed at the PySDK level.

Actions #16

Updated by Tom Clegg over 5 years ago

escape() should always escape backslash as \134, even when it isn't part of a valid escape sequence. This implementation encodes filename r'\400' as r'0:0:\400', which contradicts Keep manifest format.

Although the Python SDK can read unescaped colons in filenames, escaping them seems like a good idea. It is explicitly allowed by Keep manifest format.

By the same logic, rather than blacklisting chars, we should probably escape ":", space, and all non-printable chars (something like [:\000- ]). That would take care of \r, nul, etc.

Actions #17

Updated by Lucas Di Pentima over 5 years ago

  • Target version changed from 2019-01-16 Sprint to 2019-01-30 Sprint
Actions #18

Updated by Lucas Di Pentima over 5 years ago

Updates at 8b2eb7d1f
Test run: https://ci.curoverse.com/job/developer-run-tests/1033/

Fixes literal backslash escaping, enhances other special chars escaping & updates test.

Actions #19

Updated by Tom Clegg over 5 years ago

LGTM, thanks.

I might be getting hung up on this tangential issue, but I'll ask anyway: was there any particular reason to escape only [\t\n\r: ] but not other control chars like [:\000-\040]?

Actions #20

Updated by Lucas Di Pentima over 5 years ago

Updates at a974fc22e
Test run: https://ci.curoverse.com/job/developer-run-tests/1034/

Sorry Tom, I've now updated escape() to cover all the special chars between \000 and \040, and also ':'. Updated test too.

Actions #21

Updated by Lucas Di Pentima over 5 years ago

Removed useless comments and simplified the regex on c5e7cbef5
Test run: https://ci.curoverse.com/job/developer-run-tests/1035/

Actions #22

Updated by Tom Clegg over 5 years ago

LGTM, thanks!

Actions #23

Updated by Lucas Di Pentima over 5 years ago

  • Status changed from In Progress to Resolved
Actions #24

Updated by Peter Amstutz about 5 years ago

  • Status changed from Resolved to Feedback
  • Target version changed from 2019-01-30 Sprint to 2019-02-13 Sprint
Actions #25

Updated by Peter Amstutz about 5 years ago

crunch-job (crunchv1) was broken by these changes because manifests that previously looked like

0:525929984:sha256:6289c13e51744248b713b5d124c6148a5084544093b23996103b430fb2af7c7a.tar

Now have ":" quoted as \072

0:525929984:sha256\0726289c13e51744248b713b5d124c6148a5084544093b23996103b430fb2af7c7a.tar

As a result, it wouldn't match crunch-jobs's regular expression for picking out the image id.

Here's a quick and dirty crunch-job fix:

14539-fix-crunch-job @ ade82492825d6b78e2da35822e80f79a03e6ea67

Actions #26

Updated by Tom Clegg about 5 years ago

I get trying to do the absolute minimum, but maybe generic unescaping is easy enough that it doesn't need a special-case fix?

$filename =~ s/\\([0-3][0-7][0-7])/chr(oct($1))/ge;
Actions #27

Updated by Tom Clegg about 5 years ago

  • Related to Bug #14806: [crunch1] unescape filenames when parsing manifests in crunch-job added
Actions #28

Updated by Tom Clegg about 5 years ago

  • Status changed from Feedback to Resolved
Actions #29

Updated by Tom Morris about 5 years ago

  • Release set to 21
Actions #30

Updated by Tom Morris about 5 years ago

  • Release deleted (21)
Actions #31

Updated by Tom Morris about 5 years ago

  • Release set to 15
Actions

Also available in: Atom PDF