Project

General

Profile

Actions

Feature #12430

closed

Crunch2 limit output collection to glob patterns

Added by Peter Amstutz over 7 years ago. Updated 9 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Story points:
-
Release:
Release relationship:
Auto

Description

The current behavior for crunch-run is to upload all files in the output directory. This sometimes results in temporary files being uploaded that are not intended to be part of the output. Propose adding an "output_glob" field which is an array of filenames or glob patterns specifying which files and directories should be uploaded.

Specifically:

  • output_glob takes an array of strings.
  • If empty, fall back to default behavior (capture entire output).
  • Only basic Unix globs with ? and * wildcards only.
  • The output only includes paths that match at least one pattern in output_glob.
  • Patterns match both files and directories.
  • Directory match means capture the directory and everything inside it.
  • Pattern can include slashes to capture items in subdirectories. This means parent directories in the path are included in output but should only contain pattern matched items
  • Items are captured in place, this feature does not include rearranging files.
  • output_glob affects container reuse. output_glob must match for container reuse. Although, if we wanted to be clever, we could reuse containers where the output_glob pattern is a superset of the output_glob that we are asking for (maybe a simple version like empty [] for default behavior, or matches all ["*"]).

This feature should work for local output directory (by controlling which files are uploaded) and for the temporary collection directory (by controlling which files are propagated to the final collection). The output_glob should also apply when deciding whether to include items pre-populated in the output directory that are specified in 'mounts'.

I'm pretty sure we don't support updating an existing collection in "mounts" so we don't have to worry about that. Crunch always creates a new collection as output. We should confirm/test for that.

Examples:

Directory listing:

foo
bar
baz/quux
baz/parent1/item1

output_glob: ["foo"]
Captures:
foo

output_glob: ["f*"]
Captures:
foo

output_glob: ["f*", "b*"]
Captures:
foo
bar
baz/quux
baz/parent1/item1

output_glob: ["ba?"]
Captures:
bar
baz/quux
baz/parent1/item1

output_glob: ["ba*"]
Captures:
bar
baz/quux
baz/parent1/item1

output_glob: ["baz"]
Captures:
baz/quux
baz/parent1/item1

output_glob: ["baz/*"]
Captures:
baz/quux
baz/parent1/item1

output_glob: ["baz/parent1"]
Captures:
baz/parent1/item1

output_glob: ["baz/p*"]
Captures:
baz/parent1/item1

output_glob: ["baz/parent1/item1"]
Captures:
baz/parent1/item1

output_glob: ["quux"]
Captures:

output_glob: ["*/quux"]
Captures:
baz/quux


Subtasks 1 (0 open1 closed)

Task #21343: Review 12430-output-globResolvedPeter Amstutz05/02/2024Actions

Related issues 2 (1 open1 closed)

Related to Arvados - Bug #22467: Remove unused fulltext indexIn ProgressTom CleggActions
Blocks Arvados - Feature #9964: arvados-cwl-runner limits output data to keep using output_globResolvedPeter AmstutzActions
Actions #1

Updated by Peter Amstutz over 7 years ago

  • Description updated (diff)
Actions #2

Updated by Tom Clegg about 7 years ago

I'm not keen on this feature. It seems to creep in an awkward direction:
  1. output everything in this dir
  2. output everything in this dir that matches this glob
  3. output everything in this dir that matches any of these globs
  4. output everything in this dir that matches any of these globs, but not this glob
  5. output everything in this dir that matches any of these globs, and apply this path translation

Ideally this can all be done inside the container instead, using the shell or some other programming language of your choice. You could also add a subsequent step to the workflow that rearranges/extracts the desired files (a useful pattern for other situations too, like improving container reuse in downstream work that doesn't need to see the entire output).

Actions #3

Updated by Peter Amstutz about 7 years ago

Tom Clegg wrote:

I'm not keen on this feature. It seems to creep in an awkward direction:
  1. output everything in this dir
  2. output everything in this dir that matches this glob
  3. output everything in this dir that matches any of these globs
  4. output everything in this dir that matches any of these globs, but not this glob
  5. output everything in this dir that matches any of these globs, and apply this path translation

Nobody is asking for 4 and 5.

Ideally this can all be done inside the container instead, using the shell or some other programming language of your choice. You could also add a subsequent step to the workflow that rearranges/extracts the desired files (a useful pattern for other situations too, like improving container reuse in downstream work that doesn't need to see the entire output).

The client for this feature is arvados-cwl-runner (because output globs defined in the tool wrapper), not individual tools. The specific problem is intended to solve is programs that produce extra output that we don't want to upload, but gets uploaded anyway. The obvious solution is to have some way to specify what should and should not be uploaded.

Actions #4

Updated by Tom Clegg about 7 years ago

I agree arvados-cwl-runner's needs are important but I would still prefer to find a way to accommodate them without feeding the "container request includes a mini-language for munging inputs and outputs in various ways" pattern.

Actions #5

Updated by Peter Amstutz almost 2 years ago

  • Release set to 60
Actions #6

Updated by Peter Amstutz over 1 year ago

  • Release deleted (60)
  • Target version set to Future
Actions #7

Updated by Peter Amstutz over 1 year ago

  • Category set to Crunch

5 years later, this is still a problem. Users write code that leaves a bunch of stuff in the working directory and expect that only the 1 file that the want for output should be uploaded. We need to meet people where they are.

The proposed feature is to only capture output that matches any item in a list of specified glob patterns, aligned with how CWL output patterns work.

Actions #8

Updated by Peter Amstutz over 1 year ago

  • Related to deleted (Feature #9964: arvados-cwl-runner limits output data to keep using output_glob)
Actions #9

Updated by Peter Amstutz over 1 year ago

  • Blocks Feature #9964: arvados-cwl-runner limits output data to keep using output_glob added
Actions #10

Updated by Peter Amstutz about 1 year ago

  • Target version changed from Future to Development 2024-01-17 sprint
Actions #11

Updated by Peter Amstutz about 1 year ago

  • Target version changed from Development 2024-01-17 sprint to Development 2024-01-03 sprint
Actions #12

Updated by Tom Clegg about 1 year ago

I think this is fine if we can draw the line at "list of globs to include" (e.g., no "exclude" feature) and specify the globs we accept (e.g., gitignore style).

Feature should come with a test confirming that the following situation can't happen:
  1. cr mounts an existing collection in read+write mode at "/mnt/foo"
  2. cr output directory is "/mnt/foo" or /mnt/foo/bar"
  3. cr output glob is "*.txt"
  4. crunch-run removes files from the original collection
Actions #13

Updated by Peter Amstutz about 1 year ago

  • Description updated (diff)
Actions #14

Updated by Peter Amstutz about 1 year ago

  • Description updated (diff)
Actions #15

Updated by Peter Amstutz about 1 year ago

  • Description updated (diff)
Actions #16

Updated by Peter Amstutz about 1 year ago

  • Description updated (diff)
Actions #17

Updated by Peter Amstutz about 1 year ago

  • Target version changed from Development 2024-01-03 sprint to Development 2024-01-17 sprint
Actions #18

Updated by Peter Amstutz about 1 year ago

  • Description updated (diff)
Actions #19

Updated by Peter Amstutz about 1 year ago

  • Target version changed from Development 2024-01-17 sprint to Development 2024-01-31 sprint
Actions #20

Updated by Peter Amstutz about 1 year ago

  • Target version changed from Development 2024-01-31 sprint to Development 2024-01-17 sprint
Actions #21

Updated by Peter Amstutz about 1 year ago

  • Assigned To set to Alex Coleman
Actions #22

Updated by Peter Amstutz about 1 year ago

  • Target version changed from Development 2024-01-17 sprint to Development 2024-01-31 sprint
Actions #23

Updated by Peter Amstutz 12 months ago

  • Target version changed from Development 2024-01-31 sprint to Development 2024-02-14 sprint
  • Assigned To deleted (Alex Coleman)
Actions #24

Updated by Peter Amstutz 12 months ago

  • Target version changed from Development 2024-02-14 sprint to Development 2024-02-28 sprint
Actions #25

Updated by Peter Amstutz 11 months ago

  • Target version changed from Development 2024-02-28 sprint to Development 2024-03-13 sprint
Actions #27

Updated by Peter Amstutz 11 months ago

  • Target version changed from Development 2024-03-13 sprint to Development 2024-03-27 sprint
Actions #28

Updated by Peter Amstutz 10 months ago

  • Target version changed from Development 2024-03-27 sprint to Development 2024-04-10 sprint
Actions #29

Updated by Peter Amstutz 10 months ago

  • Release set to 70
Actions #30

Updated by Peter Amstutz 10 months ago

  • Assigned To set to Tom Clegg
Actions #31

Updated by Tom Clegg 10 months ago

It would certainly be easier to write the code -- and possibly easier to understand/use? -- if "include dir1 and its contents" is spelled "dir1/**" rather than "dir1". There is a doublestar glob matching package that works for us out of the box if we do this.

Actions #32

Updated by Peter Amstutz 10 months ago

Tom Clegg wrote in #note-31:

It would certainly be easier to write the code -- and possibly easier to understand/use? -- if "include dir1 and its contents" is spelled "dir1/**" rather than "dir1". There is a doublestar glob matching package that works for us out of the box if we do this.

The client doesn't know ahead of time if a pattern refers to a directory or not. The client of this is CWL which specifies POSIX glob matching which only has '?', character classes, and single star, not double star.

That said, since it provides multiple patterns and it only has to match any one pattern, we could double up every pattern as both "foo" and "foo/**", providing that patterns like "f*/**" do the right thing.

For reference this is how arvados-cwl-runner does it, the load bearing part is Python fnmatch() which breaks it into path segments and compares each segment against the glob pattern for that segment:

https://git.arvados.org/arvados.git/blob/HEAD:/sdk/cwl/arvados_cwl/fsaccess.py#l109

If Go doesn't have a convenient glob match function, I believe posix glob patterns can be fairly trivially converted into regular expressions.

Actions #33

Updated by Tom Clegg 10 months ago

  • Status changed from New to In Progress

POSIX matching is even easier (stdlib), but globs only match an individual entry, so without ** POSIX doesn't have a way to spell "entire directory tree whose root matches P".

As you say, if you want "all files that match P, and all files below all directories that match P" then you can say ["P", "P/**"].

Even though the main consumer of this API is cwltool, which would always say ["P", "P/**"] to get its specified "file or directory" behavior, it seems a bit weird to make that more convenient at the expense of making it impossible to spell the POSIX thing "only files that match P".

Actions #35

Updated by Tom Clegg 10 months ago

12430-output-glob @ c2ab17464100b4914b65473175ff1c85b102f1e3 -- developer-run-tests: #4131

  • All agreed upon points are implemented / addressed.
    • ** and [abc] and [^abc] also work (easier than not matching them -- is this ok?)
    • directory path itself does not match contents -- but ["directorypath", "directorypath/**"] does if that's what you want.
  • Anything not implemented (discovered or discussed during work) has a follow-up story.
    • n/a
  • Code is tested and passing, both automated and manual, what manual testing was done is described
    • automated tests pass. no manual testing.
  • Documentation has been updated.
  • Behaves appropriately at the intended scale (describe intended scale).
    • adds per-file overhead, but I think it will be negligible compared to the work involved in copying a file, unless the number of globs given is unreasonable.
  • Considered backwards and forwards compatibility issues between client and server.
    • existing container/container_request records are backfilled with output_glob=[] during migration, so container reuse will continue to match old containers with new CRs.
    • until all of client/rails/controller/a-d-c/crunch-run are upgraded, the old behavior (output all files) will continue.
  • Follows our coding standards and GUI style guidelines.

I went with output_glob as written above, but I'm still wondering whether it should be output_globs, being an array.

Actions #36

Updated by Tom Clegg 10 months ago

  • Target version changed from Development 2024-04-10 sprint to Development 2024-04-24 sprint
Actions #37

Updated by Peter Amstutz 9 months ago

  • Target version changed from Development 2024-04-24 sprint to Development 2024-05-08 sprint
Actions #38

Updated by Peter Amstutz 9 months ago

The output_glob description is pretty vague. Just as a matter of adequate API documentation, our documentation should probably include a copy of the "Patterns" documentation from doublestar:

https://github.com/bmatcuk/doublestar?tab=readme-ov-file#patterns

On that note, that section says: A doublestar (**) should appear surrounded by path separators such as /**/. This makes wonder if matching patterns as either files or directories requires ["foo", "foo/*", "foo/**/*"] ? Perhaps a few more test cases would be beneficial?

I see TestApplyGlobsToCollectionFS has a trailing /** test but TestApplyGlobsToFilesAndDirs has fewer, different test cases, which don't include trailing /** ? Shouldn't these have the same test cases?

It would be helpful to have a comment explaining that "applyGlobsToFilesAndDirs" filters out files that were found on disk before they are uploaded, whereas "applyGlobsToCollectionFS" removes files from the output collection when it has been pre-populated.

In copier.walkMount:

    case !srcMount.Writable:
        cp.logger.Printf("copying %q from %v/%v", outputRelPath, srcMount.PortableDataHash, strings.TrimPrefix(srcRelPath, "./"))
        mft, err := cp.getManifest(srcMount.PortableDataHash)
        if err != nil {
            return err
        }
        cp.manifest += mft.Extract(srcRelPath, dest).Text

One of the performance issues we are trying to fix is a situation where the tool has staged 4000 files in 4000 different collections in the working directory. It turns out that even though it is just doing a manifest text copy, loading 4000 collections one by one, at 1 second each, still takes over an hour. This implementation of output_glob would end up doing all that work only to immediately throw it out.

    output_glob text DEFAULT '[]'::text

Is there a reason not to use ::json or ::jsonb here?

Actions #39

Updated by Tom Clegg 9 months ago

Peter Amstutz wrote in #note-38:

The output_glob description is pretty vague. Just as a matter of adequate API documentation, our documentation should probably include a copy of the "Patterns" documentation from doublestar:

https://github.com/bmatcuk/doublestar?tab=readme-ov-file#patterns

Added more detail & examples.

On that note, that section says: A doublestar (**) should appear surrounded by path separators such as /**/. This makes wonder if matching patterns as either files or directories requires ["foo", "foo/*", "foo/**/*"] ? Perhaps a few more test cases would be beneficial?

No, matching either a file or an entire subtree requires just ["foo", "foo/**"].

Added a bunch more test cases, including unit tests that more explicitly check file vs. directory handling.

I see TestApplyGlobsToCollectionFS has a trailing /** test but TestApplyGlobsToFilesAndDirs has fewer, different test cases, which don't include trailing /** ? Shouldn't these have the same test cases?

The idea of TestApplyGlobsToFilesAndDirs is to test that files and dirs get removed from the list according to whether they [have descendants that] match globs, not to re-check which glob patterns match which paths. Added a comment about this.

It would be helpful to have a comment explaining that "applyGlobsToFilesAndDirs" filters out files that were found on disk before they are uploaded, whereas "applyGlobsToCollectionFS" removes files from the output collection when it has been pre-populated.

Added.

One of the performance issues we are trying to fix is a situation where the tool has staged 4000 files in 4000 different collections in the working directory. It turns out that even though it is just doing a manifest text copy, loading 4000 collections one by one, at 1 second each, still takes over an hour. This implementation of output_glob would end up doing all that work only to immediately throw it out.

Updated to skip loading collections (and walking subtree mounts) in cases where contents couldn't match globs.

Is there a reason not to use ::json or ::jsonb here?

Mainly just consistency with the other serialized columns.

12430-output-glob @ c1eab853fcda31c546b36d66ca7272ac48ac0756 -- developer-run-tests: #4195

Actions #40

Updated by Peter Amstutz 9 months ago

Tom Clegg wrote in #note-39:

Peter Amstutz wrote in #note-38:

The output_glob description is pretty vague. Just as a matter of adequate API documentation, our documentation should probably include a copy of the "Patterns" documentation from doublestar:

https://github.com/bmatcuk/doublestar?tab=readme-ov-file#patterns

Added more detail & examples.

On that note, that section says: A doublestar (**) should appear surrounded by path separators such as /**/. This makes wonder if matching patterns as either files or directories requires ["foo", "foo/*", "foo/**/*"] ? Perhaps a few more test cases would be beneficial?

No, matching either a file or an entire subtree requires just ["foo", "foo/**"].

Added a bunch more test cases, including unit tests that more explicitly check file vs. directory handling.

I see TestApplyGlobsToCollectionFS has a trailing /** test but TestApplyGlobsToFilesAndDirs has fewer, different test cases, which don't include trailing /** ? Shouldn't these have the same test cases?

The idea of TestApplyGlobsToFilesAndDirs is to test that files and dirs get removed from the list according to whether they [have descendants that] match globs, not to re-check which glob patterns match which paths. Added a comment about this.

It would be helpful to have a comment explaining that "applyGlobsToFilesAndDirs" filters out files that were found on disk before they are uploaded, whereas "applyGlobsToCollectionFS" removes files from the output collection when it has been pre-populated.

Added.

One of the performance issues we are trying to fix is a situation where the tool has staged 4000 files in 4000 different collections in the working directory. It turns out that even though it is just doing a manifest text copy, loading 4000 collections one by one, at 1 second each, still takes over an hour. This implementation of output_glob would end up doing all that work only to immediately throw it out.

Updated to skip loading collections (and walking subtree mounts) in cases where contents couldn't match globs.

Is there a reason not to use ::json or ::jsonb here?

Mainly just consistency with the other serialized columns.

12430-output-glob @ c1eab853fcda31c546b36d66ca7272ac48ac0756 -- developer-run-tests: #4195

LGTM. Will probably do some additional integration testing when implementing #9964 but this is ready to merge in order to unblock #9964

Actions #41

Updated by Tom Clegg 9 months ago

  • Status changed from In Progress to Resolved
Actions #42

Updated by Tom Clegg 6 days ago

  • Related to Bug #22467: Remove unused fulltext index added
Actions

Also available in: Atom PDF