Bug #14327

[CWL] Don't make any create/list API calls when preprocessed inputs turn out to be identical to original inputs

Added by Tom Clegg 2 months ago. Updated 2 days ago.

Status:
In Progress
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
(Total: 0.00 h)
Story points:
-

Description

(Background: arvados-cwl-runner improves container reuse by providing input collections that contain only the files that will actually be readable in the container.)

If all files in an input collection are going to be copied, arvados-cwl-runner should use the existing collection as is, instead of creating a new collection with identical content.

In a large workflow whose inputs are already optimal, creating these redundant collections can waste a significant amount of time.


Subtasks

Task #14485: ReviewNewPeter Amstutz

History

#1 Updated by Peter Amstutz 2 months ago

Need more information about what circumstances this is happening, because in most cases it should already be optimizing for this (it only saves the collection if there isn't already a collection with the same PDH.)

Perhaps the issue is actually talking about redundantly creating Python Collection objects, rather than creating collection records?

#2 Updated by Tom Clegg about 2 months ago

Colin provided a patched version of pathmapper.py. It seems to be based on 57fcaf45fe40806dca5f001aabed9b413243183e.

diff --git a/sdk/cwl/arvados_cwl/pathmapper.py b/sdk/cwl/arvados_cwl/pathmapper.py
index 27e48f1f4..0f0968b6f 100644
--- a/sdk/cwl/arvados_cwl/pathmapper.py
+++ b/sdk/cwl/arvados_cwl/pathmapper.py
@@ -159,12 +161,41 @@ class ArvPathMapper(PathMapper):

                 ab = self.collection_pattern % c.portable_data_hash()
                 self._pathmap[srcobj["location"]] = MapperEnt("keep:"+c.portable_data_hash(), ab, "Directory", True)
+
             elif srcobj["class"] == "File" and (srcobj.get("secondaryFiles") or
                 (srcobj["location"].startswith("_:") and "contents" in srcobj)):

+                src = srcobj["location"]
+                if isinstance(src, basestring) and ArvPathMapper.pdh_dirpath.match(src):
+                    # if all secondaryFiles are in the same source collection, a new collection is not needed
+                    main_me = self._pathmap.get(src)
+                    if main_me is None:
+                        logger.warn("pathmapper.setup: main file src=%s not found in pathmap" % src)
+                    else:
+                        main_dirname = os.path.dirname(main_me.target)
+                        all_same = True
+                        for l in srcobj.get("secondaryFiles", []):
+                            secondary_src = l["location"]
+                            if isinstance(secondary_src, basestring) and ArvPathMapper.pdh_dirpath.match(secondary_src):
+                                me = self._pathmap.get(secondary_src)
+                                if me is None:
+                                    logger.warn("pathmapper.setup: secondary src=%s not found in pathmap" % secondary_src)
+                                    all_same = False
+                                    break
+                                if main_dirname != os.path.dirname(me.target):
+                                    all_same = False
+                                    break
+                            else:
+                                all_same = False
+                                break
+
+                        if all_same:
+                            continue
+
                 c = arvados.collection.Collection(api_client=self.arvrunner.api,
                                                   keep_client=self.arvrunner.keep_client,
-                                                  num_retries=self.arvrunner.num_retries                                                  )
+                                                  num_retries=self.arvrunner.num_retries)
+
                 self.addentry(srcobj, c, ".", remap)

                 check = self.arvrunner.api.collections().list(filters=[["portable_data_hash", "=", c.portable_data_hash()]], limit=1).execute(num_retries=self.arvrunner.num_retries)
@@ -184,6 +215,8 @@ class ArvPathMapper(PathMapper):
                     if sub.startswith("./"):
                         ab = self.file_pattern % (c.portable_data_hash(), sub[2:])
                     else:
+                        # this code is pointless?
+                        logger.warn("pathmapper.setup: remap sub '%s' did not start with ./ for loc '%s'" % (sub, loc))
                         ab = self.file_pattern % (c.portable_data_hash(), sub)
                     self._pathmap[loc] = MapperEnt("keep:%s/%s" % (c.portable_data_hash(), sub[2:]),
                                                    ab, "Directory", True)

#3 Updated by Peter Amstutz about 2 months ago

Based on what the patch is trying to accomplish, I suspect either one of two things is happening:

  1. The file (and its secondary files) are actually being pulled out of a larger collection, or need to be moved/renamed between the source collection and the destination collection. If this is the case, creating a new collection is intentional and may be necessary to ensure container reuse. For the sake of argument I'm going to assume this isn't actually what is happening.
  2. The source collection manifest is abnormal in some way, so that the copying behavior results in different manifest text (with a different PDH) even though from the user's perspective the content is the same. Specifically, I don't know if the crunch-run produces normalized manifests.

#4 Updated by Tom Clegg about 2 months ago

It's possible the time being saved here is really only in the collections().list() call.

We could compare the PDH of the new collection to the PDH of the original collection, and skip the (comparatively slow) collections().list() check if they're equal.

#5 Updated by Tom Clegg about 2 months ago

pathmapper.py has changed significantly since the version this patch was based on. It looks like "check for existing collection with same PDH" was removed entirely during #13330.

From #13330#note-5:

Remove the existing behavior of searching for and reusing existing collections. It should always create a new collection with the correct name, trash_at time, and properties.

I'm not sure what the rationale was, but it seems like it conflicts with the desire here to have good performance with many inputs.

#6 Updated by Peter Amstutz about 2 months ago

Tom Clegg wrote:

pathmapper.py has changed significantly since the version this patch was based on. It looks like "check for existing collection with same PDH" was removed entirely during #13330.

From #13330#note-5:

Remove the existing behavior of searching for and reusing existing collections. It should always create a new collection with the correct name, trash_at time, and properties.

I'm not sure what the rationale was, but it seems like it conflicts with the desire here to have good performance with many inputs.

The rationale was to have name and properties that reflected the particular workflow run/container that created each collection, which is incompatible with reusing collection records. Similarly, it was to avoid reusing a collection with a trash_at time that meant it could disappear sooner than we wanted it to.

#7 Updated by Tom Clegg about 1 month ago

Peter Amstutz wrote:

The rationale was to have name and properties that reflected the particular workflow run/container that created each collection, which is incompatible with reusing collection records. Similarly, it was to avoid reusing a collection with a trash_at time that meant it could disappear sooner than we wanted it to.

It seems like the reasons for wanting name/properties to match the workflow/container are (a) to avoid having a proliferation of transient collections whose origin is unknown, and (b) to avoid claiming "this is an input for container A" when in fact it's also an input for containers B, C, and D.

In that case "just use the actual input instead of making a copy, if the copy would be identical to the original input" should still fit in:
  • there's no need to explain where a thing came from if the thing doesn't even exist
  • if the original input collection's name/properties do make any claims about being an input for some other container, that's not our problem -- and in any case, making a new copy of it with different name/properties wouldn't improve anything.
"Make a copy in case the original gets deleted/trashed/modified" is still potentially interesting. This a-c-r behavior should probably be configurable, because both strategies seem useful in different settings:
  • Make copies of everything before starting any containers. Things still work if someone deletes/modifies the inputs while the workflow is running.
  • Don't make copies. I already have a strategy to prevent the inputs from disappearing (e.g., collection versioning) or it isn't a big risk for me. I just want to get my results sooner.

#8 Updated by Tom Morris about 1 month ago

  • Assigned To set to Lucas Di Pentima
  • Target version set to 2018-11-28 Sprint

#9 Updated by Lucas Di Pentima about 1 month ago

  • Status changed from New to In Progress

#10 Updated by Lucas Di Pentima 18 days ago

  • Target version changed from 2018-11-28 Sprint to 2018-12-12 Sprint

#11 Updated by Tom Clegg 12 days ago

  • Subject changed from [CWL] Don't make identical copies of input collections to [CWL] Don't make any create/list API calls when preprocessed inputs turn out to be identical to original inputs

#12 Updated by Lucas Di Pentima 4 days ago

  • Target version changed from 2018-12-12 Sprint to 2018-12-21 Sprint

#13 Updated by Peter Amstutz 2 days ago

Background

In CWL, a File can have secondary files. These are files which are staged to the same directory as the primary file. Secondary files can include Directory objects with listings or other files with secondary files.

It is legal in CWL for a secondary file to be in a different location (collection) from the primary, what matters is what appears on the file system when the command runs. However, in Arvados we implement file staging by mounting collections as directories, so to ensure that secondary files are staged in the same directory alongside the primary, we copy all the files from their source collection(s) into a new collection.

This also affects reproducibility and container reuse, if the source collection has other unrelated files, they won't be copied to the new collection.

This code starts at pathmapper.py:L169 and uses addentry() to copy 'srcobj' into collection 'c':

            elif srcobj["class"] == "File" and (srcobj.get("secondaryFiles") or
                (srcobj["location"].startswith("_:") and "contents" in srcobj)):

                c = arvados.collection.Collection(api_client=self.arvrunner.api,
                                                  keep_client=self.arvrunner.keep_client,
                                                  num_retries=self.arvrunner.num_retries                                                  )
                self.addentry(srcobj, c, ".", remap)

                container = arvados_cwl.util.get_current_container(self.arvrunner.api, self.arvrunner.num_retries, logger)
                info = arvados_cwl.util.get_intermediate_collection_info(None, container, self.arvrunner.intermediate_output_ttl)

                c.save_new(name=info["name"],
                           owner_uuid=self.arvrunner.project_uuid,
                           ensure_unique_name=True,
                           trash_at=info["trash_at"],
                           properties=info["properties"])

The problem

The problem seems to be that if you have large number of input Files that include secondary files, it will do a lot of work creating new collections. That work is often redundant: the input collections already have the primary+secondary files in the right place, so the original collection could be passed through.

A previous version of this code would create the new collection but not save it, it would first check if a collection with the same portable_data_hash already existed. This is a somewhat high-overhead solution (it involves creating the Collection and making an API call) and if the manifests don't match exactly then it still means creating a new collection. The current version of this code does not even check if the PDH already exists. It always creates a new collection, in order to supply the correct intermediate_collection_info properties for the container.

The code snippet from Colin indicates they solved the bottleneck by adding a check to see if the primary and all the secondary files have the same target directory. Their solution is incomplete (it does not handle nested secondary files / Directories I mentioned above) but it avoids creating new redundant collections.

Referencing files from a larger collection has implications for reproducibility and container reuse. If the source collection has other unrelated files, they will be visible to the container, even though it didn't ask for them (which allow a workflow to rely on a file being in the input that hasn't been explicitly declared as a secondary file). Another problem is that unrelated files in the same collection could change, which would cause the PDH to change and prevent container reuse, even though the actual input files used by the container did not change. (However, we currently don't copy individual files to separate collections, despite being equally affected by unrelated changes in the same collection, so as a policy maybe just have to say this is okay).

Solution

  1. Get the leading path to the primary file (the value of 'location' up to the last slash)
  2. Recursively traverse secondaryFiles / Directories and check the leading path on 'location'
  3. If everything matches, don't need to create a Collection, and 'continue' the for srcobj in referenced_files loop

Also available in: Atom PDF