Idea #11789
closed[arv-put] add a --exclude flag to allow specific files/directories to be skipped
Description
arv-put should support a --exclude flag which accepts a glob pattern to exclude from the upload. The desired functionality is to mimic that supported by rsync --exclude. In the case where multiple flags are specified, all the patterns are excluded.
Some examples:- exclude a specific directory
--exclude Images
- exclude a specific subdirectory
--exclude 'foo/bar'
(all paths are relative) - exclude all JPGs
--exclude '*.jpg'
Updated by Tom Morris over 7 years ago
- Description updated (diff)
- Story points set to 1.0
Updated by Lucas Di Pentima over 7 years ago
- Assigned To set to Lucas Di Pentima
Updated by Lucas Di Pentima over 7 years ago
- Status changed from New to In Progress
Updated by Lucas Di Pentima over 7 years ago
The cache filename is calculated from hashing some data, including those file/dir paths that the user passed to the command. I'm filtering those matching paths when doing the first byte count and final upload, and was writing the code to also filter them out when setting up the cache file name, but now I'm in doubt if that's really what we want.
I'm thinking the use case when a big upload is executed and cancelled in the middle because the user forgot to exclude some unneeded files: If we use the --exclude
flag on the cache file name calculation, it will start from scratch the upload when the pattern matches some of the provided paths.
Updated by Lucas Di Pentima over 7 years ago
Implementation at 6c1f8e1f0 - Branch 11789-arvput-exclude-flag
Test run: https://ci.curoverse.com/job/developer-run-tests/330/ (no new tests yet)
Before I finish writing the tests I would like to confirm that the behavior is what is expected. I stumbled an issue with how fnmatch
works, as it's for only matching file names, and not file paths.
If I use fnmatch.fnmatch()
with a pattern like subdir/*.jpg
against a file path, it will match subdir/file.jpg
and also subdir/subsubdir/anotherfile.jpg
and I believe this is not what we want.
So I'm using fnmatch
when a filenames type pattern is passed, and for pathnames type patterns I'm simulating what glob.glob()
does, but without scanning the filesystem, so a pattern like the one described before will work correctly, only matching *.jpg
files/dirs inside subdir/
.
If you want to exclude *.jpg
files pseudo-recursively, you can use a pattern like: subdir/*/*.jpg
to control the depth.
Updated by Peter Amstutz over 7 years ago
- Style suggestion,
bool([])
is False, so instead of this:
path_patterns = [] exclude_paths = path_patterns if len(path_patterns) > 0 else None if exclude_paths is not None: ...
You can just write this:
path_patterns = [] if path_patterns: ...
- I don't like that the "exclude" filtering is implemented twice: once in
expected_bytes_for
and again inArvPutUploadJob.start()
Could we move the for path in self.paths:
block in ArvPutUploadJob.start()
into its own method which does the work of collecting the list of files to upload, applies excludes, and computes the expected bytes? Or alternately, can the exclude logic can be refactored into a function?
ArvPutUploadJob
takesexclude={'paths': exclude_paths, 'names': exclude_names}
and then immediately unpacks it to separate variablesexclude_paths
andexclude_names
. Is there a reason not to pass them through as separate parameters?
- The help string needs to explain that it uses glob patterns, that filenames and paths have different behavior.
- I think there might be a bug in
pathname_match()
: if I have a relative pathname of "bar" and a pattern of "bar/" it looks like it will split the pattern into ["bar", ""] which will fail to match against ["bar"].
Updated by Peter Amstutz over 7 years ago
One more thought, related to my comment about pathname_match()
, I think you need to remove both '' and '.' from the pattern before checking the path, for example --exclude "./*.jpg" to exclude only the jpgs in the current directory but not all jpgs.
Updated by Lucas Di Pentima over 7 years ago
- Target version changed from 2017-06-21 sprint to 2017-07-05 sprint
Updated by Lucas Di Pentima over 7 years ago
Updates at 441ef97e93a951b349356df96d8a6ef604c6cab7
Test run: https://ci.curoverse.com/job/developer-run-tests/361/
Addressed previous suggestions and added some tests.
One corner case need solution: When the user wants to pass a pathname pattern to exclude all files from the immediate subdirectory, currently it isn't possible:
- dir1/
- file1.txt
- file2.txt
- subdir1/
- file3.txt
If the user wants to exclude all .txt files inside dir1/, should the exclude parameter be used like this?
arv-put --exclude './*.txt' dir1
Updated by Peter Amstutz over 7 years ago
Lucas Di Pentima wrote:
Updates at 441ef97e93a951b349356df96d8a6ef604c6cab7
Test run: https://ci.curoverse.com/job/developer-run-tests/361/Addressed previous suggestions and added some tests.
One corner case need solution: When the user wants to pass a pathname pattern to exclude all files from the immediate subdirectory, currently it isn't possible:
- dir1/
- file1.txt
- file2.txt
- subdir1/
- file3.txt
If the user wants to exclude all .txt files inside dir1/, should the exclude parameter be used like this?
arv-put --exclude './*.txt' dir1
Yes. Maybe this is as simple as excluding '.'?
pat = [x for x in pattern.split(os.sep) if x != '' and x != '.']
Then it'll turn ['.', '*.txt'] into ['*.txt']
This is just style, but this list/filter/lambda code gets a little bit hairy:
dirs[:] = list(filter( lambda d: not any( [pathname_match(os.path.join(root_relpath, d), pat) for pat in self.exclude_paths]), dirs))
I think this list comprehension does the same thing?
dirs[:] = [d for d in dirs if not any( pathname_match(os.path.join(root_relpath, d), pat) for pat in self.exclude_paths)]
Rest LGTM.
Updated by Peter Amstutz over 7 years ago
Wouldn't that be
arv-put --exclude './*.txt' .
Otherwise it would be
arv-put --exclude 'dir1/*.txt' dir1
Or does "dir1" not get added to the manifest? (Seems like it should...)
Also needs a test for --exclude './*.txt'
Updated by Lucas Di Pentima over 7 years ago
Updates at d53ae1523
Test run: https://ci.curoverse.com/job/developer-run-tests/371/
Cleaned up all filter(lambda x: ...)
use, replacing them with list comprehensions (as they're being somewhat left behind on Python3).
Added support for pathname exclude patterns that refers to files/dirs below a given input dir argument: We're going to use './' for this, it will differentiate --exclude 'this.file'
from --exclude './this.file'
in the sense that the former matches all files named that way, and the latter only matches this.file
files directly below the given input directories.
Updated tests & help messages.
Updated by Lucas Di Pentima over 7 years ago
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|commit:8a4a2ee17807f35eaff8be99cca780ef1fd0ba64.