Feature #12430

Crunch2 limit output collection to glob patterns

Added by Peter Amstutz 2 months ago. Updated about 1 month ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Story points:
-

Description

The current behavior for crunch-run is to upload all files in the output directory. This sometimes results in temporary files being uploaded that are not intended to be part of the output. Propose adding an "output_glob" field which is an array of filenames or glob patterns specifying which files and directories should be uploaded.


Related issues

Related to Arvados - Bug #9964: [CWL][Crunch2][Crunch] crunchrunner should use CWL globs to output data to keepNew2016-09-07

History

#1 Updated by Peter Amstutz 2 months ago

  • Description updated (diff)

#2 Updated by Tom Clegg about 1 month ago

I'm not keen on this feature. It seems to creep in an awkward direction:
  1. output everything in this dir
  2. output everything in this dir that matches this glob
  3. output everything in this dir that matches any of these globs
  4. output everything in this dir that matches any of these globs, but not this glob
  5. output everything in this dir that matches any of these globs, and apply this path translation

Ideally this can all be done inside the container instead, using the shell or some other programming language of your choice. You could also add a subsequent step to the workflow that rearranges/extracts the desired files (a useful pattern for other situations too, like improving container reuse in downstream work that doesn't need to see the entire output).

#3 Updated by Peter Amstutz about 1 month ago

Tom Clegg wrote:

I'm not keen on this feature. It seems to creep in an awkward direction:
  1. output everything in this dir
  2. output everything in this dir that matches this glob
  3. output everything in this dir that matches any of these globs
  4. output everything in this dir that matches any of these globs, but not this glob
  5. output everything in this dir that matches any of these globs, and apply this path translation

Nobody is asking for 4 and 5.

Ideally this can all be done inside the container instead, using the shell or some other programming language of your choice. You could also add a subsequent step to the workflow that rearranges/extracts the desired files (a useful pattern for other situations too, like improving container reuse in downstream work that doesn't need to see the entire output).

The client for this feature is arvados-cwl-runner (because output globs defined in the tool wrapper), not individual tools. The specific problem is intended to solve is programs that produce extra output that we don't want to upload, but gets uploaded anyway. The obvious solution is to have some way to specify what should and should not be uploaded.

Also available in: Atom PDF