Project

General

Profile

Actions

Bug #13636

open

crunch-run takes a very long time for CWL steps with large numbers of File inputs - could use a new kind of mounts entry to address this

Added by Joshua Randall almost 6 years ago. Updated 2 months ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
API
Target version:
Story points:
-
Release:
Release relationship:
Auto

Description

crunch-run appears to be able to process about 10 collection mounts per second (most of this time is spent performing the os.Stat() calls serially in https://github.com/curoverse/arvados/blob/master/services/crunch-run/crunchrun.go#L597-L601).

This is fine when there are only a few inputs, but for a CWL step with a large number of input Files, each of which translates into a collection mount, this can mean a very long time. We have one step with over 58000 inputs (which actually does not even need to access any of those file inputs - it is simply a step that processes that transposes a matrix of files without actually accessing any of their contents).

I am guessing that the os.Stat() call is required so that the arv-mount `by_id` directory gets populated with entries that can then be bind mounted into the docker container (not sure yet if the docker container is actually going to work with 58000 inputs, but more on that later).

I believe it would be better if there was some mechanism by which CWL could specify mounts such that crunch-run could provide arv-mount with a (potentially long) list of Collections/Files that should be made accessible under a particular mount point, and then just simply mount all of them under that mount point rather than bind mounting every input separately.

One way to do this might be to add a new kind of "mounts" entry that could perhaps be called "collections". a-c-r could use it by creating a mounts structure something like this:

"mounts": {
  "/keep": {
    "kind": "collections",
    "entries": {
      "00018e1f9c6158f4c075b3d8c6a9e937+270": {
        "15253243.HXV2.paired308.1a20b18880.capmq_filtered_interval_list.interval_list.171_of_200.g.vcf.gz": {"kind": "collection", "portable_data_hash": "00018e1f9c6158f4c075b3d8c6a9e937+270", "path": "15253243.HXV2.paired308.1a20b18880.capmq_filtered_interval_list.interval_list.171_of_200.g.vcf.gz"},
        "15253243.HXV2.paired308.1a20b18880.capmq_filtered_interval_list.interval_list.171_of_200.g.vcf.gz.tbi": {"kind": "collection", "portable_data_hash": "00018e1f9c6158f4c075b3d8c6a9e937+270", "path": "15253243.HXV2.paired308.1a20b18880.capmq_filtered_interval_list.interval_list.171_of_200.g.vcf.gz.tbi"}
      },
      "00057905b21d39857138519de16eb699+310": {
        "15399452.HXV2.paired308.d78fa4a102.capmq_filtered_interval_list.interval_list.173_of_200.g.vcf.gz": {"kind": "collection", "portable_data_hash": "00057905b21d39857138519de16eb699+310", "path": "15399452.HXV2.paired308.d78fa4a102.capmq_filtered_interval_list.interval_list.173_of_200.g.vcf.gz"},
        "15399452.HXV2.paired308.d78fa4a102.capmq_filtered_interval_list.interval_list.173_of_200.g.vcf.gz.tbi": {"kind": "collection", "portable_data_hash": "00057905b21d39857138519de16eb699+310", "path": "15399452.HXV2.paired308.d78fa4a102.capmq_filtered_interval_list.interval_list.173_of_200.g.vcf.gz.tbi"}
      }
    ]
  }
}

The semantics could be that the specified mount (in the above example "/keep") could work in much the same way as the normal "by_id" arv-mount directory, although the view would be limited to only those entries listed in the "entries" hash (whose keys would be the names of virtual path prefixes below the mount point and whose value would be a hash of what basically amount to the equivalent of "collection" mounts entries.

Actions

Also available in: Atom PDF