Project

General

Profile

Actions

Bug #18489

closed

CWL: Intermittent Singularity startup failures

Added by Tom Schoonjans about 3 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
High
Assigned To:
Category:
Crunch
Target version:
Story points:
-

Description

We recently started testing a simple CWL workflow in Arvados Crunch 2.3.1. The jobs are executed on a Slurm cluster consisting of 4 nodes with 4 cores each, created as OpenStack VMs.

The workflow consists of three steps:
  1. Download 23 vcf.gz files from external, non-federated Arvados cluster using arv-get
  2. Convert these files to hap/sample using bcftools convert
  3. Upload the 23 hap/sample file pairs to the external Arvados cluster using arv-put

The first two steps use scattering to spread the jobs over the nodes.

The Crunch container runtime has been set to 'singularity'.

Running the Crunch workflow results in intermittent failures of jobs in stage 1 or 2. We are able to get to the desired result after restarting the workflow once or twice. The failure message is always:

2021-11-26T18:57:12.918958066Z ERROR [container bcftools_convert_10] (arvc1-xvhdp-ffcjthyczz8gqol) error log:
2021-11-26T18:57:12.918958066Z 
2021-11-26T18:57:12.918958066Z   2021-11-26T18:56:45.272198441Z crunch-run Not starting a gateway server (GatewayAuthSecret was not provided by dispatcher)
2021-11-26T18:57:12.918958066Z   2021-11-26T18:56:45.272309584Z crunch-run crunch-run 2.3.1 (go1.17.1) started
2021-11-26T18:57:12.918958066Z   2021-11-26T18:56:45.272331939Z crunch-run Executing container 'arvc1-dz642-tlt1uw3hyppd06g' using singularity runtime
2021-11-26T18:57:12.918958066Z   2021-11-26T18:56:45.272354161Z crunch-run Executing on host 'slurm-worker-blue-4'
2021-11-26T18:57:12.918958066Z   2021-11-26T18:56:45.433328942Z crunch-run container token "token-obfuscated" 
2021-11-26T18:57:12.918958066Z   2021-11-26T18:56:45.433798736Z crunch-run Running [arv-mount --foreground --read-write --storage-classes default --crunchstat-interval=10 --file-cache 268435456 --mount-by-pdh by_id --disable-event-listening --mount-by-id by_uuid /tmp/crunch-run.arvc1-dz642-tlt1uw3hyppd06g.1063908694/keep2883686336]
2021-11-26T18:57:12.918958066Z   2021-11-26T18:56:49.241177040Z crunch-run Fetching Docker image from collection '3126bcd60bc91b04916bae8dfadede7d+177'
2021-11-26T18:57:12.918958066Z   2021-11-26T18:56:49.372828520Z crunch-run Using Docker image id "sha256:c74eb4247d8550fa2df0c5275a9dd3b34cb105347cc0fcefdf5a05749faaf0a1" 
2021-11-26T18:57:12.918958066Z   2021-11-26T18:56:49.372958779Z crunch-run Loading Docker image from keep
2021-11-26T18:57:12.918958066Z   2021-11-26T18:56:50.079531206Z crunch-run Starting container
2021-11-26T18:57:12.918958066Z   2021-11-26T18:56:50.080267548Z crunch-run Waiting for container to finish
2021-11-26T18:57:12.918958066Z   2021-11-26T18:56:52.939218932Z stderr FATAL:   container creation failed: mount /proc/self/fd/3->/usr/local/var/singularity/mnt/session/rootfs error: while mounting image /proc/self/fd/3: failed to find loop device: could not attach image file to loop device: failed to set loop flags on loop device: resource temporarily unavailable
2021-11-26T18:57:12.918958066Z   2021-11-26T18:56:53.365279723Z crunch-run Complete
2021-11-26T18:57:13.026434934Z ERROR [container bcftools_convert_10] unable to collect output from d41d8cd98f00b204e9800998ecf8427e+0:

I am attaching the CWL files (with some redacting) that were used during our test workflow.


Files

arv-put.cwl (1.14 KB) arv-put.cwl Tom Schoonjans, 11/29/2021 03:41 PM
arv-get.cwl (718 Bytes) arv-get.cwl Tom Schoonjans, 11/29/2021 03:41 PM
bcftools-convert-vcf2hapsample.cwl (747 Bytes) bcftools-convert-vcf2hapsample.cwl Tom Schoonjans, 11/29/2021 03:41 PM
test-bcftools.cwl (1.64 KB) test-bcftools.cwl Tom Schoonjans, 11/29/2021 03:41 PM

Related issues 1 (0 open1 closed)

Related to Arvados - Support #18566: Document singularity loopback device conflict bugResolvedTom Clegg12/10/2021Actions
Actions

Also available in: Atom PDF