Bug #13636

crunch-run takes a very long time for CWL steps with large numbers of File inputs - could use a new kind of mounts entry to address this

Added by Joshua Randall 6 months ago. Updated 3 months ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
API
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
Story points:
-

Description

crunch-run appears to be able to process about 10 collection mounts per second (most of this time is spent performing the os.Stat() calls serially in https://github.com/curoverse/arvados/blob/master/services/crunch-run/crunchrun.go#L597-L601).

This is fine when there are only a few inputs, but for a CWL step with a large number of input Files, each of which translates into a collection mount, this can mean a very long time. We have one step with over 58000 inputs (which actually does not even need to access any of those file inputs - it is simply a step that processes that transposes a matrix of files without actually accessing any of their contents).

I am guessing that the os.Stat() call is required so that the arv-mount `by_id` directory gets populated with entries that can then be bind mounted into the docker container (not sure yet if the docker container is actually going to work with 58000 inputs, but more on that later).

I believe it would be better if there was some mechanism by which CWL could specify mounts such that crunch-run could provide arv-mount with a (potentially long) list of Collections/Files that should be made accessible under a particular mount point, and then just simply mount all of them under that mount point rather than bind mounting every input separately.

One way to do this might be to add a new kind of "mounts" entry that could perhaps be called "collections". a-c-r could use it by creating a mounts structure something like this:

"mounts": {
  "/keep": {
    "kind": "collections",
    "entries": {
      "00018e1f9c6158f4c075b3d8c6a9e937+270": {
        "15253243.HXV2.paired308.1a20b18880.capmq_filtered_interval_list.interval_list.171_of_200.g.vcf.gz": {"kind": "collection", "portable_data_hash": "00018e1f9c6158f4c075b3d8c6a9e937+270", "path": "15253243.HXV2.paired308.1a20b18880.capmq_filtered_interval_list.interval_list.171_of_200.g.vcf.gz"},
        "15253243.HXV2.paired308.1a20b18880.capmq_filtered_interval_list.interval_list.171_of_200.g.vcf.gz.tbi": {"kind": "collection", "portable_data_hash": "00018e1f9c6158f4c075b3d8c6a9e937+270", "path": "15253243.HXV2.paired308.1a20b18880.capmq_filtered_interval_list.interval_list.171_of_200.g.vcf.gz.tbi"}
      },
      "00057905b21d39857138519de16eb699+310": {
        "15399452.HXV2.paired308.d78fa4a102.capmq_filtered_interval_list.interval_list.173_of_200.g.vcf.gz": {"kind": "collection", "portable_data_hash": "00057905b21d39857138519de16eb699+310", "path": "15399452.HXV2.paired308.d78fa4a102.capmq_filtered_interval_list.interval_list.173_of_200.g.vcf.gz"},
        "15399452.HXV2.paired308.d78fa4a102.capmq_filtered_interval_list.interval_list.173_of_200.g.vcf.gz.tbi": {"kind": "collection", "portable_data_hash": "00057905b21d39857138519de16eb699+310", "path": "15399452.HXV2.paired308.d78fa4a102.capmq_filtered_interval_list.interval_list.173_of_200.g.vcf.gz.tbi"}
      }
    ]
  }
}

The semantics could be that the specified mount (in the above example "/keep") could work in much the same way as the normal "by_id" arv-mount directory, although the view would be limited to only those entries listed in the "entries" hash (whose keys would be the names of virtual path prefixes below the mount point and whose value would be a hash of what basically amount to the equivalent of "collection" mounts entries.

History

#1 Updated by Joshua Randall 6 months ago

Update: actually, the amount of time it takes for crunch-run to prepare 58000 mounts appears to be quite a bit longer - after 3 hours it has gotten to 28000 and is adding additional files at a rate of around 1 per second, so the instantaneous estimate to complete the remaining 30000 is over 8 hours. Not sure why it is slowing down...

#2 Updated by Joshua Randall 6 months ago

On further investigation, it looks like this many mounts causes the docker daemon to fail - they are just stuck in "Created" state forever:

# grep bd441dba528a /var/log/syslog
Jun 18 18:29:53 arvados-compute-node-eglyx-006 dockerd[1254]: time="2018-06-18T18:29:53.583014105Z" level=error msg="bd441dba528a93597cadb2461d99bbc7aa67ae14320b45dd2a5da5bdb5e221be cleanup: failed to delete container from containerd: no such container" 
Jun 18 18:29:53 arvados-compute-node-eglyx-006 dockerd[1254]: time="2018-06-18T18:29:53.583161488Z" level=error msg="Handler for POST /v1.21/containers/bd441dba528a93597cadb2461d99bbc7aa67ae14320b45dd2a5da5bdb5e221be/start returned error: grpc: received message larger than max (15190409 vs. 4194304): unknown" 
# docker ps -a |grep bd441dba528a
bd441dba528a        f9185748efcb        "python -c 'import j…"   4 hours ago         Created                                 eglyx-dz642-pfcmic2v69ymg4m

We have huge numbers of these piling up on our compute nodes:

# docker ps -a -f status=created | wc -l
281

#3 Updated by Tom Morris 6 months ago

I suspect the root cause of this is the poor performance of the collection handling code, which the FUSE driver depends on, and the FUSE driver itself.

If you're not actually accessing the file contents, but instead just the names, perhaps you can pass them as strings instead of Files to sidestep the whole issue.

If you need to reference a large number of collections (even if it's just to validate that they contain the necessary files), you are probably in trouble regardless of how you organize the work because all of those collections manifests will need to be fully parsed, turned into ArvFile objects, cached, flushed from the cache, etc.

Without making the underlying collection handling code faster/smarter, there may not be a satisfactory solution to this.

#4 Updated by Tom Clegg 5 months ago

If crunch-run is spending a lot of time doing serial calls to Stat(), it might help to stop doing them serially.

The error message suggests the purpose of the Stat() calls is to confirm that the files exist. This could be done without hitting arv-mount at all, by loading the collections with native Go code in crunch-run. It wouldn't have the (presumably desirable) side effect of preloading arv-mount's cache, but it would be much faster.

Either way, I think it's worth adding a way to tell crunch-run that it's OK to mount "/a/b", "/a/c", and "/a/d" by bind-mounting "/a" to a directory that's managed by the fuse driver, rather than bind-mounting b, c, and d individually -- IOW, "/a" can be empty and non-writable regardless of what the docker image has there.

Rather than a "multiple collections" mount or otherwise extending the "mounts" structure to express a tree of mounts, though, I think it would be better to just specify that "/a" is empty and non-writable, and let crunch-run take advantage of this to provide the desired mounts more efficiently.
  • avoids the possibility of ambiguous mount structures where a single path "/a/b" has two different specs, one given under the multi-mount rooted at "/a", and another in a mount rooted at "/a/b"
  • avoids coupling the shape of the mounts spec to [the client's opinion about] the strategy for implementing a given set of mounts

#5 Updated by Tom Clegg 5 months ago

To clarify, even if the Stat() loop is fast, docker can't handle the number of mount points we need. Some form of "one bind mount satisfies multiple container mounts" facility is definitely required.

#6 Updated by Joshua Randall 5 months ago

As Tom mentioned, in addition to performance issues, we see a more fundamental problem in that docker is unable to launch containers with too many mounts - the problem manifests itself in that the container configuration json is too large to be passed to libcontainer (grpc Max message size exceeded).

#7 Updated by Tom Morris 4 months ago

  • Target version set to To Be Groomed

#8 Updated by Peter Amstutz 4 months ago

There's arv-mount --mount-by-id but that only lets you rename the logical by_id directory, not specify individual collections.

I agree with Tom Clegg that it would be useful to have a feature to pass a list of PDH or UUIDs that should be accessible through the mount point
(probably through a file since passing 58000 command line arguments probably won't work). I don't believe that feature has been implemented. That would require some changes to arv-mount and also updating crunch-run to use it.

We have one step with over 58000 inputs (which actually does not even need to access any of those file inputs - it is simply a step that processes that transposes a matrix of files without actually accessing any of their contents).

I don't know exactly what you're doing, but have you considered alternative approaches?

  1. Use a CWL ExpressionTool to reorganize the files in javascript into the desired structure
  2. Consolidate inputs into a list of collection PDHs (strings, not File objects) and pass that to a command line tool which uses the Arvados Python SDK to copy the files to an output collection, produce an cwl.output.json referencing the output collection.

#9 Updated by Joshua Randall 3 months ago

While it's true that our steps that basically just provide CWL JSON rearrangement functionality have the largest number of inputs, we also have "real" steps which have 36k actual input files. For example, the GATK4 workflow we are currently running has 18k gVCFs and 18k gVCF indices which are the outputs from 18k HaplotypeCaller runs that are being fed into a single GenomicsDBImport step to be merged into a single GenomicsDB for subsequent GenotypeGVCFs.

This should be a totally reasonable thing to do, and with the new functionality we have implemented as a fix to this issue, the GenomicsDBImport steps are running on many inputs without issue (GenomicsDBImport opens and reads the inputs in turn rather than trying to open them all at once).

The code we are running that fix 13636 for us are:
- https://github.com/wtsi-hgi/bindmapfuse
- https://github.com/wtsi-hgi/arvados/tree/13636-crunch-run-bindmapfuse
- https://github.com/wtsi-hgi/arvados/tree/13636-a-c-r-empty-keep-mount

Note that our crunch-run changes include disabling the `stat` that it used to perform on all keep bind mounts prior to starting docker, because it is no longer required since the `bindmapfuse` mount point itself is the only thing from the keep tree that docker bind mounts.

As a result, ExpressionTool steps (such as doing a matrix transpose of an array of an array of Files; https://github.com/wtsi-hgi/arvados-pipelines/blob/master/cwl/workflows/gatk-4.0.0.0-joint-calling.cwl#L49-L53) are now running very quickly because there is little cost to having a lot of keep File inputs that are not actually accessed.

#10 Updated by Joshua Randall 3 months ago

N.B. that matrix transpose is not a real CWL ExpressionTool but rather a CommandLineTool that we use in place of an expression tool.

#11 Updated by Peter Amstutz 3 months ago

I'm glad you found a solution.

I need to study the problem a bit more, I suspect the permanent solution will be in arv-mount / crunch-run to avoid the need for bindmapfuse.

Also available in: Atom PDF