Project

General

Profile

Actions

Feature #12430

closed

Crunch2 limit output collection to glob patterns

Added by Peter Amstutz about 7 years ago. Updated 7 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Story points:
-
Release:
Release relationship:
Auto

Description

The current behavior for crunch-run is to upload all files in the output directory. This sometimes results in temporary files being uploaded that are not intended to be part of the output. Propose adding an "output_glob" field which is an array of filenames or glob patterns specifying which files and directories should be uploaded.

Specifically:

  • output_glob takes an array of strings.
  • If empty, fall back to default behavior (capture entire output).
  • Only basic Unix globs with ? and * wildcards only.
  • The output only includes paths that match at least one pattern in output_glob.
  • Patterns match both files and directories.
  • Directory match means capture the directory and everything inside it.
  • Pattern can include slashes to capture items in subdirectories. This means parent directories in the path are included in output but should only contain pattern matched items
  • Items are captured in place, this feature does not include rearranging files.
  • output_glob affects container reuse. output_glob must match for container reuse. Although, if we wanted to be clever, we could reuse containers where the output_glob pattern is a superset of the output_glob that we are asking for (maybe a simple version like empty [] for default behavior, or matches all ["*"]).

This feature should work for local output directory (by controlling which files are uploaded) and for the temporary collection directory (by controlling which files are propagated to the final collection). The output_glob should also apply when deciding whether to include items pre-populated in the output directory that are specified in 'mounts'.

I'm pretty sure we don't support updating an existing collection in "mounts" so we don't have to worry about that. Crunch always creates a new collection as output. We should confirm/test for that.

Examples:

Directory listing:

foo
bar
baz/quux
baz/parent1/item1

output_glob: ["foo"]
Captures:
foo

output_glob: ["f*"]
Captures:
foo

output_glob: ["f*", "b*"]
Captures:
foo
bar
baz/quux
baz/parent1/item1

output_glob: ["ba?"]
Captures:
bar
baz/quux
baz/parent1/item1

output_glob: ["ba*"]
Captures:
bar
baz/quux
baz/parent1/item1

output_glob: ["baz"]
Captures:
baz/quux
baz/parent1/item1

output_glob: ["baz/*"]
Captures:
baz/quux
baz/parent1/item1

output_glob: ["baz/parent1"]
Captures:
baz/parent1/item1

output_glob: ["baz/p*"]
Captures:
baz/parent1/item1

output_glob: ["baz/parent1/item1"]
Captures:
baz/parent1/item1

output_glob: ["quux"]
Captures:

output_glob: ["*/quux"]
Captures:
baz/quux


Subtasks 1 (0 open1 closed)

Task #21343: Review 12430-output-globResolvedPeter Amstutz05/02/2024Actions

Related issues

Blocks Arvados - Feature #9964: arvados-cwl-runner limits output data to keep using output_globResolvedPeter AmstutzActions
Actions

Also available in: Atom PDF