Project

General

Profile

Actions

Bug #14327

closed

[CWL] Don't make any create/list API calls when preprocessed inputs turn out to be identical to original inputs

Added by Tom Clegg about 6 years ago. Updated almost 6 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Story points:
-
Release relationship:
Auto

Description

(Background: arvados-cwl-runner improves container reuse by providing input collections that contain only the files that will actually be readable in the container.)

If all files in an input collection are going to be copied, arvados-cwl-runner should use the existing collection as is, instead of creating a new collection with identical content.

In a large workflow whose inputs are already optimal, creating these redundant collections can waste a significant amount of time.


Subtasks 1 (0 open1 closed)

Task #14485: Review 14327-cwl-unnecessary-collectionsResolvedLucas Di Pentima12/18/2018Actions
Actions #1

Updated by Peter Amstutz about 6 years ago

Need more information about what circumstances this is happening, because in most cases it should already be optimizing for this (it only saves the collection if there isn't already a collection with the same PDH.)

Perhaps the issue is actually talking about redundantly creating Python Collection objects, rather than creating collection records?

Actions #2

Updated by Tom Clegg about 6 years ago

Colin provided a patched version of pathmapper.py. It seems to be based on 57fcaf45fe40806dca5f001aabed9b413243183e.

diff --git a/sdk/cwl/arvados_cwl/pathmapper.py b/sdk/cwl/arvados_cwl/pathmapper.py
index 27e48f1f4..0f0968b6f 100644
--- a/sdk/cwl/arvados_cwl/pathmapper.py
+++ b/sdk/cwl/arvados_cwl/pathmapper.py
@@ -159,12 +161,41 @@ class ArvPathMapper(PathMapper):

                 ab = self.collection_pattern % c.portable_data_hash()
                 self._pathmap[srcobj["location"]] = MapperEnt("keep:"+c.portable_data_hash(), ab, "Directory", True)
+
             elif srcobj["class"] == "File" and (srcobj.get("secondaryFiles") or
                 (srcobj["location"].startswith("_:") and "contents" in srcobj)):

+                src = srcobj["location"]
+                if isinstance(src, basestring) and ArvPathMapper.pdh_dirpath.match(src):
+                    # if all secondaryFiles are in the same source collection, a new collection is not needed
+                    main_me = self._pathmap.get(src)
+                    if main_me is None:
+                        logger.warn("pathmapper.setup: main file src=%s not found in pathmap" % src)
+                    else:
+                        main_dirname = os.path.dirname(main_me.target)
+                        all_same = True
+                        for l in srcobj.get("secondaryFiles", []):
+                            secondary_src = l["location"]
+                            if isinstance(secondary_src, basestring) and ArvPathMapper.pdh_dirpath.match(secondary_src):
+                                me = self._pathmap.get(secondary_src)
+                                if me is None:
+                                    logger.warn("pathmapper.setup: secondary src=%s not found in pathmap" % secondary_src)
+                                    all_same = False
+                                    break
+                                if main_dirname != os.path.dirname(me.target):
+                                    all_same = False
+                                    break
+                            else:
+                                all_same = False
+                                break
+
+                        if all_same:
+                            continue
+
                 c = arvados.collection.Collection(api_client=self.arvrunner.api,
                                                   keep_client=self.arvrunner.keep_client,
-                                                  num_retries=self.arvrunner.num_retries                                                  )
+                                                  num_retries=self.arvrunner.num_retries)
+
                 self.addentry(srcobj, c, ".", remap)

                 check = self.arvrunner.api.collections().list(filters=[["portable_data_hash", "=", c.portable_data_hash()]], limit=1).execute(num_retries=self.arvrunner.num_retries)
@@ -184,6 +215,8 @@ class ArvPathMapper(PathMapper):
                     if sub.startswith("./"):
                         ab = self.file_pattern % (c.portable_data_hash(), sub[2:])
                     else:
+                        # this code is pointless?
+                        logger.warn("pathmapper.setup: remap sub '%s' did not start with ./ for loc '%s'" % (sub, loc))
                         ab = self.file_pattern % (c.portable_data_hash(), sub)
                     self._pathmap[loc] = MapperEnt("keep:%s/%s" % (c.portable_data_hash(), sub[2:]),
                                                    ab, "Directory", True)
Actions #3

Updated by Peter Amstutz about 6 years ago

Based on what the patch is trying to accomplish, I suspect either one of two things is happening:

  1. The file (and its secondary files) are actually being pulled out of a larger collection, or need to be moved/renamed between the source collection and the destination collection. If this is the case, creating a new collection is intentional and may be necessary to ensure container reuse. For the sake of argument I'm going to assume this isn't actually what is happening.
  2. The source collection manifest is abnormal in some way, so that the copying behavior results in different manifest text (with a different PDH) even though from the user's perspective the content is the same. Specifically, I don't know if the crunch-run produces normalized manifests.
Actions #4

Updated by Tom Clegg about 6 years ago

It's possible the time being saved here is really only in the collections().list() call.

We could compare the PDH of the new collection to the PDH of the original collection, and skip the (comparatively slow) collections().list() check if they're equal.

Actions #5

Updated by Tom Clegg about 6 years ago

pathmapper.py has changed significantly since the version this patch was based on. It looks like "check for existing collection with same PDH" was removed entirely during #13330.

From #13330#note-5:

Remove the existing behavior of searching for and reusing existing collections. It should always create a new collection with the correct name, trash_at time, and properties.

I'm not sure what the rationale was, but it seems like it conflicts with the desire here to have good performance with many inputs.

Actions #6

Updated by Peter Amstutz about 6 years ago

Tom Clegg wrote:

pathmapper.py has changed significantly since the version this patch was based on. It looks like "check for existing collection with same PDH" was removed entirely during #13330.

From #13330#note-5:

Remove the existing behavior of searching for and reusing existing collections. It should always create a new collection with the correct name, trash_at time, and properties.

I'm not sure what the rationale was, but it seems like it conflicts with the desire here to have good performance with many inputs.

The rationale was to have name and properties that reflected the particular workflow run/container that created each collection, which is incompatible with reusing collection records. Similarly, it was to avoid reusing a collection with a trash_at time that meant it could disappear sooner than we wanted it to.

Actions #7

Updated by Tom Clegg about 6 years ago

Peter Amstutz wrote:

The rationale was to have name and properties that reflected the particular workflow run/container that created each collection, which is incompatible with reusing collection records. Similarly, it was to avoid reusing a collection with a trash_at time that meant it could disappear sooner than we wanted it to.

It seems like the reasons for wanting name/properties to match the workflow/container are (a) to avoid having a proliferation of transient collections whose origin is unknown, and (b) to avoid claiming "this is an input for container A" when in fact it's also an input for containers B, C, and D.

In that case "just use the actual input instead of making a copy, if the copy would be identical to the original input" should still fit in:
  • there's no need to explain where a thing came from if the thing doesn't even exist
  • if the original input collection's name/properties do make any claims about being an input for some other container, that's not our problem -- and in any case, making a new copy of it with different name/properties wouldn't improve anything.
"Make a copy in case the original gets deleted/trashed/modified" is still potentially interesting. This a-c-r behavior should probably be configurable, because both strategies seem useful in different settings:
  • Make copies of everything before starting any containers. Things still work if someone deletes/modifies the inputs while the workflow is running.
  • Don't make copies. I already have a strategy to prevent the inputs from disappearing (e.g., collection versioning) or it isn't a big risk for me. I just want to get my results sooner.
Actions #8

Updated by Tom Morris about 6 years ago

  • Assigned To set to Lucas Di Pentima
  • Target version set to 2018-11-28 Sprint
Actions #9

Updated by Lucas Di Pentima about 6 years ago

  • Status changed from New to In Progress
Actions #10

Updated by Lucas Di Pentima about 6 years ago

  • Target version changed from 2018-11-28 Sprint to 2018-12-12 Sprint
Actions #11

Updated by Tom Clegg about 6 years ago

  • Subject changed from [CWL] Don't make identical copies of input collections to [CWL] Don't make any create/list API calls when preprocessed inputs turn out to be identical to original inputs
Actions #12

Updated by Lucas Di Pentima about 6 years ago

  • Target version changed from 2018-12-12 Sprint to 2018-12-21 Sprint
Actions #13

Updated by Peter Amstutz about 6 years ago

Background

In CWL, a File can have secondary files. These are files which are staged to the same directory as the primary file. Secondary files can include Directory objects with listings or other files with secondary files.

It is legal in CWL for a secondary file to be in a different location (collection) from the primary, what matters is what appears on the file system when the command runs. However, in Arvados we implement file staging by mounting collections as directories, so to ensure that secondary files are staged in the same directory alongside the primary, we copy all the files from their source collection(s) into a new collection.

This also affects reproducibility and container reuse, if the source collection has other unrelated files, they won't be copied to the new collection.

This code starts at pathmapper.py:L169 and uses addentry() to copy 'srcobj' into collection 'c':

            elif srcobj["class"] == "File" and (srcobj.get("secondaryFiles") or
                (srcobj["location"].startswith("_:") and "contents" in srcobj)):

                c = arvados.collection.Collection(api_client=self.arvrunner.api,
                                                  keep_client=self.arvrunner.keep_client,
                                                  num_retries=self.arvrunner.num_retries                                                  )
                self.addentry(srcobj, c, ".", remap)

                container = arvados_cwl.util.get_current_container(self.arvrunner.api, self.arvrunner.num_retries, logger)
                info = arvados_cwl.util.get_intermediate_collection_info(None, container, self.arvrunner.intermediate_output_ttl)

                c.save_new(name=info["name"],
                           owner_uuid=self.arvrunner.project_uuid,
                           ensure_unique_name=True,
                           trash_at=info["trash_at"],
                           properties=info["properties"])

The problem

The problem seems to be that if you have large number of input Files that include secondary files, it will do a lot of work creating new collections. That work is often redundant: the input collections already have the primary+secondary files in the right place, so the original collection could be passed through.

A previous version of this code would create the new collection but not save it, it would first check if a collection with the same portable_data_hash already existed. This is a somewhat high-overhead solution (it involves creating the Collection and making an API call) and if the manifests don't match exactly then it still means creating a new collection. The current version of this code does not even check if the PDH already exists. It always creates a new collection, in order to supply the correct intermediate_collection_info properties for the container.

The code snippet from Colin indicates they solved the bottleneck by adding a check to see if the primary and all the secondary files have the same target directory. Their solution is incomplete (it does not handle nested secondary files / Directories I mentioned above) but it avoids creating new redundant collections.

Referencing files from a larger collection has implications for reproducibility and container reuse. If the source collection has other unrelated files, they will be visible to the container, even though it didn't ask for them (which allow a workflow to rely on a file being in the input that hasn't been explicitly declared as a secondary file). Another problem is that unrelated files in the same collection could change, which would cause the PDH to change and prevent container reuse, even though the actual input files used by the container did not change. (However, we currently don't copy individual files to separate collections, despite being equally affected by unrelated changes in the same collection, so as a policy maybe just have to say this is okay).

Solution

  1. Get the leading path to the primary file (the value of 'location' up to the last slash)
  2. Recursively traverse secondaryFiles / Directories and check the leading path on 'location'
  3. If everything matches, don't need to create a Collection, and 'continue' the for srcobj in referenced_files loop
Actions #14

Updated by Peter Amstutz about 6 years ago

  • Assigned To changed from Lucas Di Pentima to Peter Amstutz
Actions #16

Updated by Lucas Di Pentima about 6 years ago

Some small details:

  • pathmapper.py:L138: We could avoid one indentation level by using for s in srcob.get([“secondaryFiles”, []):, maybe also on L143
  • I think the commit message would be a good code comment too.
  • Tests comments would also be welcomed, as some cases differ from each other in very small details and it’s not obvious to detect at first glance
  • Other than that, LGTM.
Actions #17

Updated by Peter Amstutz about 6 years ago

  • Status changed from In Progress to Resolved
Actions #18

Updated by Tom Morris almost 6 years ago

  • Release set to 15
Actions

Also available in: Atom PDF