Feature #12430
openCrunch2 limit output collection to glob patterns
Description
The current behavior for crunch-run is to upload all files in the output directory. This sometimes results in temporary files being uploaded that are not intended to be part of the output. Propose adding an "output_glob" field which is an array of filenames or glob patterns specifying which files and directories should be uploaded.
Specifically:
output_glob
takes an array of strings.- If empty, fall back to default behavior (capture entire output).
- Only basic Unix globs with
?
and*
wildcards only. - The output only includes paths that match at least one pattern in
output_glob
. - Patterns match both files and directories.
- Directory match means capture the directory and everything inside it.
- Pattern can include slashes to capture items in subdirectories. This means parent directories in the path are included in output but should only contain pattern matched items
- Items are captured in place, this feature does not include rearranging files.
output_glob
affects container reuse. output_glob must match for container reuse. Although, if we wanted to be clever, we could reuse containers where the output_glob pattern is a superset of the output_glob that we are asking for (maybe a simple version like empty[]
for default behavior, or matches all["*"]
).
This feature should work for local output directory (by controlling which files are uploaded) and for the temporary collection directory (by controlling which files are propagated to the final collection). The output_glob should also apply when deciding whether to include items pre-populated in the output directory that are specified in 'mounts'.
I'm pretty sure we don't support updating an existing collection in "mounts" so we don't have to worry about that. Crunch always creates a new collection as output. We should confirm/test for that.
Examples:
Directory listing:
foo
bar
baz/quux
baz/parent1/item1
output_glob: ["foo"]
Captures:
foo
output_glob: ["f*"]
Captures:
foo
output_glob: ["f*", "b*"]
Captures:
foo
bar
baz/quux
baz/parent1/item1
output_glob: ["ba?"]
Captures:
bar
baz/quux
baz/parent1/item1
output_glob: ["ba*"]
Captures:
bar
baz/quux
baz/parent1/item1
output_glob: ["baz"]
Captures:
baz/quux
baz/parent1/item1
output_glob: ["baz/*"]
Captures:
baz/quux
baz/parent1/item1
output_glob: ["baz/parent1"]
Captures:
baz/parent1/item1
output_glob: ["baz/p*"]
Captures:
baz/parent1/item1
output_glob: ["baz/parent1/item1"]
Captures:
baz/parent1/item1
output_glob: ["quux"]
Captures:
output_glob: ["*/quux"]
Captures:
baz/quux
Related issues
Updated by Tom Clegg over 6 years ago
- output everything in this dir
- output everything in this dir that matches this glob
- output everything in this dir that matches any of these globs
- output everything in this dir that matches any of these globs, but not this glob
- output everything in this dir that matches any of these globs, and apply this path translation
Ideally this can all be done inside the container instead, using the shell or some other programming language of your choice. You could also add a subsequent step to the workflow that rearranges/extracts the desired files (a useful pattern for other situations too, like improving container reuse in downstream work that doesn't need to see the entire output).
Updated by Peter Amstutz over 6 years ago
Tom Clegg wrote:
I'm not keen on this feature. It seems to creep in an awkward direction:
- output everything in this dir
- output everything in this dir that matches this glob
- output everything in this dir that matches any of these globs
- output everything in this dir that matches any of these globs, but not this glob
- output everything in this dir that matches any of these globs, and apply this path translation
Nobody is asking for 4 and 5.
Ideally this can all be done inside the container instead, using the shell or some other programming language of your choice. You could also add a subsequent step to the workflow that rearranges/extracts the desired files (a useful pattern for other situations too, like improving container reuse in downstream work that doesn't need to see the entire output).
The client for this feature is arvados-cwl-runner (because output globs defined in the tool wrapper), not individual tools. The specific problem is intended to solve is programs that produce extra output that we don't want to upload, but gets uploaded anyway. The obvious solution is to have some way to specify what should and should not be uploaded.
Updated by Tom Clegg about 6 years ago
I agree arvados-cwl-runner's needs are important but I would still prefer to find a way to accommodate them without feeding the "container request includes a mini-language for munging inputs and outputs in various ways" pattern.
Updated by Peter Amstutz 10 months ago
- Release deleted (
60) - Target version set to Future
Updated by Peter Amstutz 10 months ago
- Category set to Crunch
5 years later, this is still a problem. Users write code that leaves a bunch of stuff in the working directory and expect that only the 1 file that the want for output should be uploaded. We need to meet people where they are.
The proposed feature is to only capture output that matches any item in a list of specified glob patterns, aligned with how CWL output patterns work.
Updated by Peter Amstutz 10 months ago
- Related to deleted (Feature #9964: arvados-cwl-runner limits output data to keep using output_glob)
Updated by Peter Amstutz 10 months ago
- Blocks Feature #9964: arvados-cwl-runner limits output data to keep using output_glob added
Updated by Peter Amstutz 5 months ago
- Target version changed from Future to Development 2024-01-17 sprint
Updated by Peter Amstutz 5 months ago
- Target version changed from Development 2024-01-17 sprint to Development 2024-01-03 sprint
Updated by Tom Clegg 4 months ago
I think this is fine if we can draw the line at "list of globs to include" (e.g., no "exclude" feature) and specify the globs we accept (e.g., gitignore style).
Feature should come with a test confirming that the following situation can't happen:- cr mounts an existing collection in read+write mode at "/mnt/foo"
- cr output directory is "/mnt/foo" or /mnt/foo/bar"
- cr output glob is "*.txt"
- crunch-run removes files from the original collection
Updated by Peter Amstutz 4 months ago
- Target version changed from Development 2024-01-03 sprint to Development 2024-01-17 sprint
Updated by Peter Amstutz 4 months ago
- Target version changed from Development 2024-01-17 sprint to Development 2024-01-31 sprint
Updated by Peter Amstutz 3 months ago
- Target version changed from Development 2024-01-31 sprint to Development 2024-01-17 sprint
Updated by Peter Amstutz 2 months ago
- Target version changed from Development 2024-01-17 sprint to Development 2024-01-31 sprint
Updated by Peter Amstutz about 2 months ago
- Target version changed from Development 2024-01-31 sprint to Development 2024-02-14 sprint
- Assigned To deleted (
Alex Coleman)
Updated by Peter Amstutz about 2 months ago
- Target version changed from Development 2024-02-14 sprint to Development 2024-02-28 sprint
Updated by Peter Amstutz about 1 month ago
- Target version changed from Development 2024-02-28 sprint to Development 2024-03-13 sprint
Updated by Peter Amstutz about 1 month ago
- Target version changed from Development 2024-03-13 sprint to Development 2024-03-27 sprint
Updated by Peter Amstutz 16 days ago
- Target version changed from Development 2024-03-27 sprint to Development 2024-04-10 sprint