Project

General

Profile

Actions

Bug #20561

closed

Log when files from input are being propagated to output in crunch-run finalization

Added by Peter Amstutz 11 months ago. Updated 11 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Story points:
0.5
Release relationship:
Auto

Description

"Maximum container memory rss usage"

then nothing for almost 2 hours, then finishes up with

"copying /file.txt (200000 bytes)"
"maximum keepstore memory rss"
...
Completed

On further investigation.

The output collection has ~4400 files, but except for the one file that was reported as being copied, it looks like these are staged to an intermediate collection and then made to appear in the output directory, and then propagated to the output collection.

So it seems like it is doing something that causes it to iterate over each of the 4400 files, it only needs to take 1.5s to process each file for that to add up to nearly two hours.

The input consists of an array of 4400 files, each file is pulled from a different collection, so I think what is happening is that it is sequentially fetching 4400 collections with manifest text.

Things to do:

  1. Log that this is happening (print out each file being added)
  2. We don't actually need these files in the output at all, we should support a regex filter on what gets collected for the output collection and don't upload or propagate files that the user doesn't want. There's actually a really old ticket for this! #9964

Subtasks 1 (0 open1 closed)

Task #20562: Review 20561-file-copy-loggingResolvedPeter Amstutz05/30/2023Actions

Related issues

Related to Arvados - Feature #9964: arvados-cwl-runner limits output data to keep using output_globNewPeter AmstutzActions
Actions

Also available in: Atom PDF