Bug #11579

arv-put doesn't traverse symlinked directories

Added by Lucas Di Pentima over 3 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
04/27/2017
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-

Description

The new version of arv-put is silently ignoring directory symlinks while the previous version didn't.
  • Allow symlink traversal by default, adding a flag to disable this behaviour.
  • Add a check so that the command won't go into an infinite uploads.
  • Add test coverage.

Subtasks

Task #11589: Review 11579-arvput-follow-symlinksResolvedPeter Amstutz

Associated revisions

Revision 613155d5
Added by Lucas Di Pentima over 3 years ago

Merge branch '11579-arvput-follow-symlinks'
Closes #11579

History

#1 Updated by Lucas Di Pentima over 3 years ago

  • Status changed from New to In Progress

#2 Updated by Lucas Di Pentima over 3 years ago

Changes at: e289b9ad5 - branch 11579-arvput-follow-symlinks
Test run: https://ci.curoverse.com/job/developer-run-tests/261/

Reverted default behavior so that arv-put follow symlinked directories.
Added flag --no-follow-links to allow the user avoid this, if needed.
Symlinked dirs will only be traversed once per upload, warning the user when a directory is linked more than once.

#3 Updated by Peter Amstutz over 3 years ago

While testing this, I tried to upload a non-existing directory, and the backtrace is pretty terrible, this needs to be cleaned up:

$ arv-put  stuff
2017-04-28 10:48:30 arvados.arv_put[1041] INFO: Creating new cache file at /home/peter/.cache/arvados/arv-put/11f5dc3d482fcb99c82089cbb6dbe037
0 2017-04-28 10:48:30 arvados.arv_put[1041] WARNING: Abnormal termination:
Traceback (most recent call last):
  File "/home/peter/work/scripts/venv/local/lib/python2.7/site-packages/arvados_python_client-0.1.20170427201553-py2.7.egg/arvados/commands/put.py", line 470, in start
    self.filename or os.path.basename(path))
  File "/home/peter/work/scripts/venv/local/lib/python2.7/site-packages/arvados_python_client-0.1.20170427201553-py2.7.egg/arvados/commands/put.py", line 611, in _check_file
    'mtime': os.path.getmtime(source),
  File "/home/peter/work/scripts/venv/lib/python2.7/genericpath.py", line 54, in getmtime
    return os.stat(filename).st_mtime
OSError: [Errno 2] No such file or directory: '/home/peter/tmp/stuff/stuff'

Traceback (most recent call last):
  File "/home/peter/work/scripts/venv/bin/arv-put", line 4, in <module>
    __import__('pkg_resources').run_script('arvados-python-client==0.1.20170427201553', 'arv-put')
  File "/home/peter/work/scripts/venv/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 738, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/home/peter/work/scripts/venv/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1499, in run_script
    exec(code, namespace, namespace)
  File "/home/peter/work/scripts/venv/lib/python2.7/site-packages/arvados_python_client-0.1.20170427201553-py2.7.egg/EGG-INFO/scripts/arv-put", line 4, in <module>
    main()
  File "/home/peter/work/scripts/venv/local/lib/python2.7/site-packages/arvados_python_client-0.1.20170427201553-py2.7.egg/arvados/commands/put.py", line 992, in main
    writer.start(save_collection=not(args.stream or args.raw))
  File "/home/peter/work/scripts/venv/local/lib/python2.7/site-packages/arvados_python_client-0.1.20170427201553-py2.7.egg/arvados/commands/put.py", line 470, in start
    self.filename or os.path.basename(path))
  File "/home/peter/work/scripts/venv/local/lib/python2.7/site-packages/arvados_python_client-0.1.20170427201553-py2.7.egg/arvados/commands/put.py", line 611, in _check_file
    'mtime': os.path.getmtime(source),
  File "/home/peter/work/scripts/venv/lib/python2.7/genericpath.py", line 54, in getmtime
    return os.stat(filename).st_mtime
OSError: [Errno 2] No such file or directory: '/home/peter/tmp/stuff/stuff'

#4 Updated by Peter Amstutz over 3 years ago

Here's my (evil) test case:

b1/
b1/b
b2 -> b1/
b3/
b3/b3
b3/b -> ../b4/
b4/
b4/b -> ../b3/
b4/b4
b5 -> b2/
bar
baz -> bar
quux -> baz

A few edge cases:

--no-follow-links still follows links to files (in the example "baz" and "quux" are uploaded).

It uploads both "b1" and the symlink-to-b1 "b2". However it refuses to upload "b5" which is a symlink-to-symlink.

I made a new directory:

b1 -> ../stuff/b1/
b5 -> ../stuff/b5/

This resulting in the error "Skipping '/home/peter/tmp/stuff2/b1' symlink to directory '/home/peter/tmp/stuff/b1' because it was already uploaded"

It only created "b5".

Then I tried

b0 -> ../stuff/b5/
b1 -> ../stuff/b1/

and got "Skipping '/home/peter/tmp/stuff2/b0' symlink to directory '/home/peter/tmp/stuff/b1' because it was already uploaded"

In this case it only created "b1".

The behavior seems sensitive to sort order.

I believe the correct behavior should be copy everything as if they were regular files and directories, except when symlinks create a infinite loop. When you detect that something is a symlink to something that has already been uploaded, you can make a copy within the Collection (that should go for both file and directories).

#5 Updated by Peter Amstutz over 3 years ago

Suggest maybe you keep a mapping of local file path to its location in the collection, after calling realpath(), check if the file has already been uploaded and if so copy it within the collection.

You might need a slightly different strategy for detecting symlink cycles.

#6 Updated by Lucas Di Pentima over 3 years ago

Updates at 4d012b23a
Test run: https://ci.curoverse.com/job/developer-run-tests/266/

As discussed on the chat, I focused on fixing the regression, that is, the fact that this new arv-put didn't followed symlinked directories.
Fixed faulty behavior when asking to not follow symlinks, it is now ignoring both files and directories.
Also Peter asked me to enhance the error message when passing a non existant path, so now instead of printing a huge backtrace, it will log a proper message.

After this is merged, I'll continue adding the remaining features and fixes related to symlink recursion and duplication.

#7 Updated by Peter Amstutz over 3 years ago

11579-arvput-follow-symlinks @ 4d012b23a4ac88f433986054fc0085ee6714b5b3

The "file not found" message is still printing a backtrace?

2017-05-02 09:46:36 arvados.arv_put[22714] INFO: Creating new cache file at /home/peter/.cache/arvados/arv-put/7ca31b0f6a86bfebd692b4b00d1192ed
0 2017-05-02 09:46:36 arvados.arv_put[22714] WARNING: Abnormal termination:
Traceback (most recent call last):
  File "/home/peter/work/scripts/venv/local/lib/python2.7/site-packages/arvados_python_client-0.1.20170501210502-py2.7.egg/arvados/commands/put.py", line 438, in start
    raise PathDoesNotExistError("file or directory '{}' does not exist.".format(path))
PathDoesNotExistError: file or directory 'stuff3' does not exist.

2017-05-02 09:46:36 arvados.arv_put[22714] ERROR: arv-put: file or directory 'stuff3' does not exist.

Rest LGTM

#8 Updated by Lucas Di Pentima over 3 years ago

Updated code & merged master at fe45b1b66c730f2546d78a7899375707c0816518
Running tests before merging to master: https://ci.curoverse.com/job/developer-run-tests/267/

#9 Updated by Lucas Di Pentima over 3 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:613155d587da60dbe04c7635649b1f3694938adc.

Also available in: Atom PDF