Project

General

Profile

Actions

Feature #12430

open

Crunch2 limit output collection to glob patterns

Added by Peter Amstutz over 5 years ago. Updated 14 days ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
Crunch
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
Story points:
-

Description

The current behavior for crunch-run is to upload all files in the output directory. This sometimes results in temporary files being uploaded that are not intended to be part of the output. Propose adding an "output_glob" field which is an array of filenames or glob patterns specifying which files and directories should be uploaded.


Related issues

Blocks Arvados - Feature #9964: arvados-cwl-runner limits output data to keep using output_globNew09/07/2016

Actions
Actions #1

Updated by Peter Amstutz over 5 years ago

  • Description updated (diff)
Actions #2

Updated by Tom Clegg over 5 years ago

I'm not keen on this feature. It seems to creep in an awkward direction:
  1. output everything in this dir
  2. output everything in this dir that matches this glob
  3. output everything in this dir that matches any of these globs
  4. output everything in this dir that matches any of these globs, but not this glob
  5. output everything in this dir that matches any of these globs, and apply this path translation

Ideally this can all be done inside the container instead, using the shell or some other programming language of your choice. You could also add a subsequent step to the workflow that rearranges/extracts the desired files (a useful pattern for other situations too, like improving container reuse in downstream work that doesn't need to see the entire output).

Actions #3

Updated by Peter Amstutz over 5 years ago

Tom Clegg wrote:

I'm not keen on this feature. It seems to creep in an awkward direction:
  1. output everything in this dir
  2. output everything in this dir that matches this glob
  3. output everything in this dir that matches any of these globs
  4. output everything in this dir that matches any of these globs, but not this glob
  5. output everything in this dir that matches any of these globs, and apply this path translation

Nobody is asking for 4 and 5.

Ideally this can all be done inside the container instead, using the shell or some other programming language of your choice. You could also add a subsequent step to the workflow that rearranges/extracts the desired files (a useful pattern for other situations too, like improving container reuse in downstream work that doesn't need to see the entire output).

The client for this feature is arvados-cwl-runner (because output globs defined in the tool wrapper), not individual tools. The specific problem is intended to solve is programs that produce extra output that we don't want to upload, but gets uploaded anyway. The obvious solution is to have some way to specify what should and should not be uploaded.

Actions #4

Updated by Tom Clegg over 5 years ago

I agree arvados-cwl-runner's needs are important but I would still prefer to find a way to accommodate them without feeding the "container request includes a mini-language for munging inputs and outputs in various ways" pattern.

Actions #5

Updated by Peter Amstutz 4 months ago

  • Release set to 60
Actions #6

Updated by Peter Amstutz 14 days ago

  • Release deleted (60)
  • Target version set to To be groomed
Actions #7

Updated by Peter Amstutz 14 days ago

  • Category set to Crunch

5 years later, this is still a problem. Users write code that leaves a bunch of stuff in the working directory and expect that only the 1 file that the want for output should be uploaded. We need to meet people where they are.

The proposed feature is to only capture output that matches any item in a list of specified glob patterns, aligned with how CWL output patterns work.

Actions #8

Updated by Peter Amstutz 14 days ago

  • Related to deleted (Feature #9964: arvados-cwl-runner limits output data to keep using output_glob)
Actions #9

Updated by Peter Amstutz 14 days ago

  • Blocks Feature #9964: arvados-cwl-runner limits output data to keep using output_glob added
Actions

Also available in: Atom PDF