Bug #14539

[SDKs] [arv-mount] Use "." placeholder to persist empty directories

Added by Tom Clegg about 1 year ago. Updated 10 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
SDKs
Target version:
Start date:
12/18/2018
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-
Release relationship:
Auto

Description

When saving a collection, encode an empty directory like this:

./dir/emptysubdir d41d8cd98f00b204e9800998ecf8427e+0 0:0:\056

Subtasks

Task #14541: Review 14539-pysdk-empty-dirlResolvedLucas Di Pentima


Related issues

Related to Arvados - Bug #14806: [crunch1] unescape filenames when parsing manifests in crunch-jobResolved02/05/2019

Associated revisions

Revision 4620478d
Added by Lucas Di Pentima 11 months ago

Merge branch '14539-pysdk-empty-dir'
Closes #14539

Arvados-DCO-1.1-Signed-off-by: Lucas Di Pentima <>

History

#1 Updated by Tom Clegg about 1 year ago

  • Target version changed from Arvados Future Sprints to 2018-12-12 Sprint

#2 Updated by Tom Morris about 1 year ago

  • Target version changed from 2018-12-12 Sprint to 2018-11-28 Sprint

#3 Updated by Tom Morris about 1 year ago

  • Target version changed from 2018-11-28 Sprint to 2018-12-12 Sprint

#5 Updated by Lucas Di Pentima about 1 year ago

  • Assigned To set to Lucas Di Pentima

#6 Updated by Lucas Di Pentima about 1 year ago

  • Target version changed from 2018-12-12 Sprint to 2018-12-21 Sprint

#7 Updated by Lucas Di Pentima 12 months ago

  • Status changed from New to In Progress

#8 Updated by Lucas Di Pentima 12 months ago

Updates at 56959ea8492ec4f08aa90c89a8f41e5c278e3c41 - branch 14539-pysdk-empty-dir
Test run: https://ci.curoverse.com/job/developer-run-tests/1013/

  • Persist empty subdirectories by adding an empty file named \056 (".") to the manifest.
  • Don't allow explicitly use that name on file or directory names.
  • Test fixes & additions.

#10 Updated by Tom Clegg 12 months ago

The "save empty dir" part LGTM but I would like the test to include an empty directory whose name needs escaping, like "/foo bar/baz waz".

This part doesn't seem right. (Matching a specific encoding like \056 is a clue encoding/decoding is not being done properly.)

diff --git a/sdk/python/arvados/collection.py b/sdk/python/arvados/collection.py
index 627f0346d..c2517c618 100644
--- a/sdk/python/arvados/collection.py
+++ b/sdk/python/arvados/collection.py
@@ -600,6 +600,9 @@ class RichCollectionBase(CollectionBase):

         pathcomponents = path.split("/", 1)
         if pathcomponents[0]:
+            # Don't allow naming files/dirs \\056
+            if pathcomponents[0] == "\\056":
+                raise IOError(errno.EINVAL, "Invalid name", pathcomponents[0])
             item = self._items.get(pathcomponents[0])
             if len(pathcomponents) == 1:
                 if item is None:

Creating files and directories named '\056' should work fine. The manifest would look like this:

./\134056 d41d8cd98f00b204e9800998ecf8427e+0 0:0:\134056

The specific case of '\056' might be rare enough not to care about, but this seems to be revealing that something isn't escaping/unescaping backslashes properly. Indeed, a file named \040 seems to be written as \040 and then read back as " ", and this causes arv-mount to see its own updates as conflicts:

tomclegg@shell.4xphq:~/keepw/home/test$ touch '\040'
tomclegg@shell.4xphq:~/keepw/home/test$ ls -l
total 1
-rwxrwxrwx 1 tomclegg tomclegg 0 Dec 18 16:35 \040
-rwxrwxrwx 1 tomclegg tomclegg 0 Dec 18 16:35  ~20181218-163603~conflict~
tomclegg@shell.4xphq:~/keepw/home/test$ touch '\040'
tomclegg@shell.4xphq:~/keepw/home/test$ touch '\040'
tomclegg@shell.4xphq:~/keepw/home/test$ ls -l
total 2
-rwxrwxrwx 1 tomclegg tomclegg 0 Dec 18 16:35 \040
-rwxrwxrwx 1 tomclegg tomclegg 0 Dec 18 16:35  ~20181218-163603~conflict~
-rwxrwxrwx 1 tomclegg tomclegg 0 Dec 18 16:35  ~20181218-163644~conflict~
-rwxrwxrwx 1 tomclegg tomclegg 0 Dec 18 16:35  ~20181218-163656~conflict~

We should fix that instead of disallowing files/dirs named \056.

#11 Updated by Lucas Di Pentima 12 months ago

Ok, so after lots of testing and hair pulling, I think I got it to work. Squashed the many test commits by using rebase -i:

Update at 13e7ad8135a0bafc3d1d225ff7e4c62de4f3b43f
Test run: https://ci.curoverse.com/job/developer-run-tests/1020/

  • Moved the escaping to _normalize_stream(), doing it on find_or_create() was not right.
  • Removed code prohibiting to use '\056' as a file/dir name.
  • Updated tests.

#13 Updated by Lucas Di Pentima 11 months ago

  • Target version changed from 2018-12-21 Sprint to 2019-01-16 Sprint

#14 Updated by Tom Clegg 11 months ago

LGTM, thanks.

A couple of follow-ups, though:

This comment in FuseRmTest is now obsolete:

         # Can't have empty directories :-( so manifest will be empty.

Along with space and backslash, are there other chars that are legal in filenames but would be misinterpreted in a manifest? I suspect newline needs to be escaped, and we should probably do tab too (even if tab works unescaped, it could be distracting while troubleshooting).

#15 Updated by Lucas Di Pentima 11 months ago

Update at 735143cbf
Test run: https://ci.curoverse.com/job/developer-run-tests/1032/

Added escaping for \n and \t on file and stream names. Just in case I tested the use of : on file names and it seemed to me that it doesn't need escaping, no issues observed at the PySDK level.

#16 Updated by Tom Clegg 11 months ago

escape() should always escape backslash as \134, even when it isn't part of a valid escape sequence. This implementation encodes filename r'\400' as r'0:0:\400', which contradicts Keep manifest format.

Although the Python SDK can read unescaped colons in filenames, escaping them seems like a good idea. It is explicitly allowed by Keep manifest format.

By the same logic, rather than blacklisting chars, we should probably escape ":", space, and all non-printable chars (something like [:\000- ]). That would take care of \r, nul, etc.

#17 Updated by Lucas Di Pentima 11 months ago

  • Target version changed from 2019-01-16 Sprint to 2019-01-30 Sprint

#18 Updated by Lucas Di Pentima 11 months ago

Updates at 8b2eb7d1f
Test run: https://ci.curoverse.com/job/developer-run-tests/1033/

Fixes literal backslash escaping, enhances other special chars escaping & updates test.

#19 Updated by Tom Clegg 11 months ago

LGTM, thanks.

I might be getting hung up on this tangential issue, but I'll ask anyway: was there any particular reason to escape only [\t\n\r: ] but not other control chars like [:\000-\040]?

#20 Updated by Lucas Di Pentima 11 months ago

Updates at a974fc22e
Test run: https://ci.curoverse.com/job/developer-run-tests/1034/

Sorry Tom, I've now updated escape() to cover all the special chars between \000 and \040, and also ':'. Updated test too.

#21 Updated by Lucas Di Pentima 11 months ago

Removed useless comments and simplified the regex on c5e7cbef5
Test run: https://ci.curoverse.com/job/developer-run-tests/1035/

#22 Updated by Tom Clegg 11 months ago

LGTM, thanks!

#23 Updated by Lucas Di Pentima 11 months ago

  • Status changed from In Progress to Resolved

#24 Updated by Peter Amstutz 10 months ago

  • Status changed from Resolved to Feedback
  • Target version changed from 2019-01-30 Sprint to 2019-02-13 Sprint

#25 Updated by Peter Amstutz 10 months ago

crunch-job (crunchv1) was broken by these changes because manifests that previously looked like

0:525929984:sha256:6289c13e51744248b713b5d124c6148a5084544093b23996103b430fb2af7c7a.tar

Now have ":" quoted as \072

0:525929984:sha256\0726289c13e51744248b713b5d124c6148a5084544093b23996103b430fb2af7c7a.tar

As a result, it wouldn't match crunch-jobs's regular expression for picking out the image id.

Here's a quick and dirty crunch-job fix:

14539-fix-crunch-job @ ade82492825d6b78e2da35822e80f79a03e6ea67

#26 Updated by Tom Clegg 10 months ago

I get trying to do the absolute minimum, but maybe generic unescaping is easy enough that it doesn't need a special-case fix?

$filename =~ s/\\([0-3][0-7][0-7])/chr(oct($1))/ge;

#27 Updated by Tom Clegg 10 months ago

  • Related to Bug #14806: [crunch1] unescape filenames when parsing manifests in crunch-job added

#28 Updated by Tom Clegg 10 months ago

  • Status changed from Feedback to Resolved

#29 Updated by Tom Morris 10 months ago

  • Release set to 21

#30 Updated by Tom Morris 10 months ago

  • Release deleted (21)

#31 Updated by Tom Morris 10 months ago

  • Release set to 15

Also available in: Atom PDF