Idea #17755
closedTest singularity support on a cloud cluster by running some real workflows
Description
On a cloud cluster with singularity enabled (requires new compute image + config changes), run the the WGS workflow with small chr19 inputs, eg pirca-xvhdp-m4r8n9qmoh5ggra
Updated by Peter Amstutz over 3 years ago
- Status changed from New to In Progress
Updated by Peter Amstutz over 3 years ago
- Status changed from In Progress to New
- Tracker changed from Bug to Idea
Updated by Peter Amstutz over 3 years ago
- Description updated (diff)
- Subject changed from Test singularity support on 9tee4 by running some real workflows to Test singularity support on a cloud cluster by running some real workflows
Updated by Ward Vandewege over 3 years ago
- Status changed from New to In Progress
Updated by Ward Vandewege over 3 years ago
- Related to Idea #17296: Singularity proof of concept added
Updated by Ward Vandewege over 3 years ago
I have built a new compute node image for ce8i5 that has the singularity binary 3.5.2 (that's an old version) in it, cf. commit: 784a3f24d37819186a52ea2c67e15e5bd8639076 on branch 17755-add-singularity-to-compute-image. That commit builds singularity from source while creating the packer image which seems suboptimal (it requires a lot of build-time dependencies, and it takes a while).
I've run the diagnostics workflow with it at ce8i5-xvhdp-el23s4fjsrp1mjb.
Observations:- the conversion of a ~240MiB docker tar file took over 5 minutes (ouch!!), about 4 of which seem to be in the invocation of `mksquashfs`
- this conversion is invisible to the user, it's part of the "loading the image" stage so it's not considered when (e.g.) workbench shows workflow step durations.
- the `hasher3` step ran on the same node as the `hasher1` step, but it repeated the singularity import step. Ouch.
ce8i5-xvhdp-ia0zn3njn6lwbdj: docker: 0m9s(6m queued)
ce8i5-xvhdp-el23s4fjsrp1mjb: singularity: 0m7s(11m queued)
Updated by Ward Vandewege over 3 years ago
I upgraded the ce8i5 compute image to use singularity 3.7.4 and switched ce8i5 over to singularity again. The amount of time it takes to convert the docker image to a singularity image seems unchanged (i.e. it's very slow).
The CWL tests were running, and all the jobs started failing because they are trying to write to a read-only filesystem. E.g. ce8i5-xvhdp-6aj1g0l31rae2y4, ce8i5-xvhdp-x95gwpw2gdf3113, etc. See run-cwl-test-ce8i5: #387 . I filed #17816 for this problem.
I then switched ce8i5 back to use docker.
Further testing on this ticket is probably pointless until we fix #17816.
Updated by Ward Vandewege over 3 years ago
- Related to Bug #17816: singularity not setting working directory added
Updated by Ward Vandewege over 3 years ago
- Related to deleted (Bug #17816: singularity not setting working directory)
Updated by Ward Vandewege over 3 years ago
- Blocks Bug #17816: singularity not setting working directory added
Updated by Ward Vandewege over 3 years ago
- Blocks deleted (Bug #17816: singularity not setting working directory)
Updated by Ward Vandewege over 3 years ago
- Blocked by Bug #17816: singularity not setting working directory added
Updated by Peter Amstutz over 3 years ago
- Target version changed from 2021-06-23 sprint to 2021-07-07 sprint
Updated by Tom Clegg over 3 years ago
- Target version changed from 2021-07-07 sprint to 2021-07-21 sprint
Updated by Peter Amstutz over 3 years ago
- Target version changed from 2021-07-21 sprint to 2021-08-04 sprint
Updated by Ward Vandewege over 3 years ago
Revisiting this with latest main (which has the fix from #17816).
I built a new compute image for ce8i5 with main + my 17755-add-singularity-to-compute-image branch at c2aceca339ec3a6f3d853865cebd0efe348ff518, image id: ce8i5-compute-v1627063559
The diagnostics workflow worked fine, as before: https://workbench.ce8i5.arvadosapi.com/container_requests/ce8i5-xvhdp-6f6xwvqpjhusey0
Re-running one of the cwl tests that failed last time succeeded:
https://workbench.ce8i5.arvadosapi.com/container_requests/ce8i5-xvhdp-wqwktdy4ijpvrgx#Log
I kicked off the cwl tests workflow at run-cwl-test-ce8i5: #420
It failed (errors to be checked), and it took 14 hours (!) as opposed to the normal 3+ hours.
The next batch of singularity fixes (#17813) has been merged, which includes the fix to do the image conversion just once. The cwl tests are running against it at run-cwl-test-ce8i5: #423 . That run completed at 2 hours and 31 minutes, about 35 minutes faster than with Docker (!). One test failure remains, because we do not support the Docker ENTRYPOINT feature yet.
Updated by Peter Amstutz over 3 years ago
- Related to Idea #16305: Singularity support added
Updated by Ward Vandewege over 3 years ago
Preparing to run pirca-xvhdp-m4r8n9qmoh5ggra. The costanalyzer says these node types were used in that run on pirca:
m5.2xlarge
m5.4xlarge
m5.large
m5.xlarge
t3.small
Our quota on ce8i5 (azure) is tiny, so maybe we'll move this testing over to tordo (aws) where we have a bit more room. Building compute image with singularity for tordo at packer-build-compute-image: #138 ...
That image built, I've reconfigured tordo to use Singularity with it. I'm gathering the files needed to run this workflow at https://workbench.tordo.arvadosapi.com/projects/tordo-j7d0g-q56odvtnbo37ajt.
Updated by Peter Amstutz over 3 years ago
- Target version changed from 2021-08-04 sprint to 2021-08-18 sprint
Updated by Ward Vandewege over 3 years ago
Ward Vandewege wrote:
Preparing to run pirca-xvhdp-m4r8n9qmoh5ggra. The costanalyzer says these node types were used in that run on pirca:
m5.2xlarge
m5.4xlarge
m5.large
m5.xlarge
t3.smallOur quota on ce8i5 (azure) is tiny, so maybe we'll move this testing over to tordo (aws) where we have a bit more room. Building compute image with singularity for tordo at packer-build-compute-image: #138 ...
That image built, I've reconfigured tordo to use Singularity with it. I'm gathering the files needed to run this workflow at https://workbench.tordo.arvadosapi.com/projects/tordo-j7d0g-q56odvtnbo37ajt.
Hmm, tordo seems to be having problems running its usual diag jobs because `singularity build` requires mksquashfs, from the squashfs-tools package (cf. diagnostics-tordo: #259 /console). I've added that dependency to the 17755-add-singularity-to-compute-image branch and am building a new image with it for Tordo in packer-build-compute-image: #139 /console. Curious I didn't see that on ce8i5. I suppose the azure Debian base image includes that package, and the AWS one does not.
New AMI for tordo is ami-0d9c1d43a4700db01, the config has been updated, and the diagnostics job succeeded at diagnostics-tordo: #260 /console with singularity.
Good. Now, the actual chromosome 19 workflow.
The gist of it is to check out https://github.com/arvados/arvados-tutorial and then something like arvados-cwl-runner cwl/wgs-processing-wf.cwl yml/wgs-processing-wf-chr19.yml
I copied all the collections referenced in wgs-processing-wf-chr19.yml to my 17755 test project at tordo-j7d0g-q56odvtnbo37ajt.
And then:
arvados-cwl-runner --project-uuid tordo-j7d0g-q56odvtnbo37ajt ./cwl/wgs-processing-wf.cwl ./yml/wgs-processing-wf-chr19.yml INFO /usr/bin/arvados-cwl-runner 2.3.0.dev20210729201354, arvados-python-client 2.3.0.dev20210726143205, cwltool 3.0.20210319143721 INFO Resolved './cwl/wgs-processing-wf.cwl' to 'file:///home/ward/arvados-tutorial/WGS-processing/cwl/wgs-processing-wf.cwl' INFO Using cluster tordo (https://workbench2.tordo.arvadosapi.com/) INFO Upload local files: "head.html" "generatereport.py" "tail.html" INFO Uploaded to 4fce132bc4ae9b4cc844115246a6bd41+175 (tordo-4zz18-9p4undcuyh75qot) INFO ['docker', 'pull', 'broadinstitute/gatk:4.1.7.0'] 4.1.7.0: Pulling from broadinstitute/gatk 9ff7e2e5f967: Pull complete 59856638ac9f: Pull complete 6f317d6d954b: Pull complete a9dde5e2a643: Pull complete f7e0d1ec4184: Pull complete 8ec82ad80803: Pull complete Digest: sha256:192fedf9b9d65809da4a2954030269e3f311d296e6b5e4c6c7deec12c7fe84b2 Status: Downloaded newer image for broadinstitute/gatk:4.1.7.0 docker.io/broadinstitute/gatk:4.1.7.0 INFO Uploading Docker image broadinstitute/gatk:4.1.7.0 2021-08-06 21:12:40 arvados.arv_put[16368] INFO: Creating new cache file at /home/ward/.cache/arvados/arv-put/6e8880e401836b16215bb738675fcb39 4022M / 4022M 100.0% 2021-08-06 21:13:48 arvados.arv_put[16368] INFO: 2021-08-06 21:13:48 arvados.arv_put[16368] INFO: Collection saved as 'Docker image broadinstitute gatk:4.1.7.0 sha256:e7b5b' tordo-4zz18-bc2qarec0qvov3i 2021-08-06 21:13:49 cwltool[16368] INFO: ['docker', 'pull', 'curii/clinvar-report'] Using default tag: latest Error response from daemon: pull access denied for curii/clinvar-report, repository does not exist or may require 'docker login': denied: requested access to the resource is denied 2021-08-06 21:13:49 cwltool[16368] ERROR: Workflow error, try again with --debug for more information: cwl/helper/gvcf-to-vcf.cwl:7:5: Command '['docker', 'pull', 'curii/clinvar-report']' returned non-zero exit status 1. (X-Request-Id: req-lqjx8gknsk1uif6s7g0m)
Hmmm, looks like I missed that one.
We should probably publish that container on dockerhub...
ward@shell:~/arvados-tutorial/WGS-processing$ arv-copy pirca-4zz18-rn8581yvkk73lmc --project-uuid tordo-j7d0g-q56odvtnbo37ajt pirca-4zz18-rn8581yvkk73lmc: 1067M / 1067M 100.0% 2021-08-06 21:16:19 arvados.arv-copy[16704] INFO: 2021-08-06 21:16:19 arvados.arv-copy[16704] INFO: Success: created copy with uuid tordo-4zz18-4zqcmpzmobikh0w
Take 2:
ward@shell:~/arvados-tutorial/WGS-processing$ arvados-cwl-runner --project-uuid tordo-j7d0g-q56odvtnbo37ajt ./cwl/wgs-processing-wf.cwl ./yml/wgs-processing-wf-chr19.yml INFO /usr/bin/arvados-cwl-runner 2.3.0.dev20210729201354, arvados-python-client 2.3.0.dev20210726143205, cwltool 3.0.20210319143721 INFO Resolved './cwl/wgs-processing-wf.cwl' to 'file:///home/ward/arvados-tutorial/WGS-processing/cwl/wgs-processing-wf.cwl' INFO Using cluster tordo (https://workbench2.tordo.arvadosapi.com/) INFO Upload local files: "head.html" "generatereport.py" "tail.html" INFO Using collection 4fce132bc4ae9b4cc844115246a6bd41+175 (tordo-4zz18-9p4undcuyh75qot) INFO ['docker', 'pull', 'curii/bwa-samtools'] Using default tag: latest Error response from daemon: pull access denied for curii/bwa-samtools, repository does not exist or may require 'docker login': denied: requested access to the resource is denied ERROR Workflow error, try again with --debug for more information: cwl/helper/bwamem-samtools-view.cwl:7:5: Command '['docker', 'pull', 'curii/bwa-samtools']' returned non-zero exit status 1.
Jeez. More copying:
ward@shell:~/arvados-tutorial/WGS-processing$ arv-copy pirca-4zz18-mjt3nkmmuuo4uos --project-uuid tordo-j7d0g-q56odvtnbo37ajt pirca-4zz18-mjt3nkmmuuo4uos: 999M / 999M 100.0% 2021-08-06 21:21:46 arvados.arv-copy[17076] INFO: 2021-08-06 21:21:46 arvados.arv-copy[17076] INFO: Success: created copy with uuid tordo-4zz18-8r0gqxll14jcpfv
And one more try:
ward@shell:~/arvados-tutorial/WGS-processing$ arvados-cwl-runner --project-uuid tordo-j7d0g-q56odvtnbo37ajt ./cwl/wgs-processing-wf.cwl ./yml/wgs-processing-wf-chr19.yml INFO /usr/bin/arvados-cwl-runner 2.3.0.dev20210729201354, arvados-python-client 2.3.0.dev20210726143205, cwltool 3.0.20210319143721 INFO Resolved './cwl/wgs-processing-wf.cwl' to 'file:///home/ward/arvados-tutorial/WGS-processing/cwl/wgs-processing-wf.cwl' INFO Using cluster tordo (https://workbench2.tordo.arvadosapi.com/) INFO Upload local files: "generatereport.py" "tail.html" "head.html" INFO Using collection 4fce132bc4ae9b4cc844115246a6bd41+175 (tordo-4zz18-9p4undcuyh75qot) INFO ['docker', 'pull', 'biocontainers/fastqc:v0.11.9_cv6'] v0.11.9_cv6: Pulling from biocontainers/fastqc 9ff7e2e5f967: Already exists 59856638ac9f: Already exists 6f317d6d954b: Already exists a9dde5e2a643: Already exists 675cac559d07: Pull complete 0e6dd20ee818: Pull complete 374c558e71da: Pull complete 0df3c64bb19a: Pull complete e936d7784cf9: Pull complete 4dfd8d164cc5: Pull complete 473490461998: Pull complete 8f5e491552e6: Pull complete a66ab3a674d9: Pull complete 18f922275a6f: Pull complete ef8214645dd0: Pull complete Digest: sha256:c740cf547548689724bf43b3e923fea6a7588c17d59b7cbd76fa9ac7544102c5 Status: Downloaded newer image for biocontainers/fastqc:v0.11.9_cv6 docker.io/biocontainers/fastqc:v0.11.9_cv6 INFO Uploading Docker image biocontainers/fastqc:v0.11.9_cv6 2021-08-06 21:24:52 arvados.arv_put[17135] INFO: Creating new cache file at /home/ward/.cache/arvados/arv-put/28720ca8e76a2892c0ea812e2e0e0ab6 1001M / 1001M 100.0% 2021-08-06 21:25:06 arvados.arv_put[17135] INFO: 2021-08-06 21:25:06 arvados.arv_put[17135] INFO: Collection saved as 'Docker image biocontainers fastqc:v0.11.9_cv6 sha256:33755' tordo-4zz18-h6aj03stuegf0hu 2021-08-06 21:25:08 cwltool[17135] INFO: Using collection cache size 256 MiB 2021-08-06 21:25:10 arvados.cwl-runner[17135] INFO: [container WGS processing workflow scattered over samples] submitted container_request tordo-xvhdp-4dc27lh513rh7ai 2021-08-06 21:25:10 arvados.cwl-runner[17135] INFO: Monitor workflow progress at https://workbench2.tordo.arvadosapi.com/processes/tordo-xvhdp-4dc27lh513rh7ai
It's running now. I should have used --no-wait, but I'm in screen so it's all right.
Updated by Ward Vandewege over 3 years ago
That almost worked. One step got stuck in the 'Queued' state:
https://workbench.tordo.arvadosapi.com/container_requests/tordo-xvhdp-iwwdqh2w5wyprix#Log
Looks like the container was restarted twice. The third copy appears just stuck.
Nothing obviously wrong on the compute node:
root@ip-10-253-254-170:/home/crunch# singularity --version singularity version 3.7.4
root@ip-10-253-254-170:/var/log# df -h Filesystem Size Used Avail Use% Mounted on udev 970M 0 970M 0% /dev tmpfs 196M 18M 179M 9% /run /dev/nvme1n1p1 7.7G 3.0G 4.4G 41% / tmpfs 980M 0 980M 0% /dev/shm tmpfs 5.0M 4.0K 5.0M 1% /run/lock tmpfs 980M 0 980M 0% /sys/fs/cgroup /dev/nvme1n1p15 124M 278K 124M 1% /boot/efi tmpfs 196M 0 196M 0% /run/user/1000 /dev/mapper/tmp 47G 81M 47G 1% /tmp tmpfs 196M 0 196M 0% /run/user/1001
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 2 0.0 0.0 0 0 ? S 21:26 0:00 [kthreadd] root 3 0.0 0.0 0 0 ? I< 21:26 0:00 \_ [rcu_gp] root 4 0.0 0.0 0 0 ? I< 21:26 0:00 \_ [rcu_par_gp] root 6 0.0 0.0 0 0 ? I< 21:26 0:00 \_ [kworker/0:0H-kblockd] root 8 0.0 0.0 0 0 ? I< 21:26 0:00 \_ [mm_percpu_wq] root 9 0.0 0.0 0 0 ? S 21:26 0:00 \_ [ksoftirqd/0] root 10 0.0 0.0 0 0 ? I 21:26 0:00 \_ [rcu_sched] root 11 0.0 0.0 0 0 ? I 21:26 0:00 \_ [rcu_bh] root 12 0.0 0.0 0 0 ? S 21:26 0:00 \_ [migration/0] root 14 0.0 0.0 0 0 ? S 21:26 0:00 \_ [cpuhp/0] root 15 0.0 0.0 0 0 ? S 21:26 0:00 \_ [cpuhp/1] root 16 0.0 0.0 0 0 ? S 21:26 0:00 \_ [migration/1] root 17 0.0 0.0 0 0 ? S 21:26 0:00 \_ [ksoftirqd/1] root 19 0.0 0.0 0 0 ? I< 21:26 0:00 \_ [kworker/1:0H-kblockd] root 20 0.0 0.0 0 0 ? S 21:26 0:00 \_ [kdevtmpfs] root 21 0.0 0.0 0 0 ? I< 21:26 0:00 \_ [netns] root 22 0.0 0.0 0 0 ? S 21:26 0:00 \_ [kauditd] root 23 0.0 0.0 0 0 ? S 21:26 0:00 \_ [khungtaskd] root 24 0.0 0.0 0 0 ? S 21:26 0:00 \_ [oom_reaper] root 25 0.0 0.0 0 0 ? I< 21:26 0:00 \_ [writeback] root 26 0.0 0.0 0 0 ? S 21:26 0:00 \_ [kcompactd0] root 27 0.0 0.0 0 0 ? SN 21:26 0:00 \_ [ksmd] root 28 0.0 0.0 0 0 ? SN 21:26 0:00 \_ [khugepaged] root 29 0.0 0.0 0 0 ? I< 21:26 0:00 \_ [crypto] root 30 0.0 0.0 0 0 ? I< 21:26 0:00 \_ [kintegrityd] root 31 0.0 0.0 0 0 ? I< 21:26 0:00 \_ [kblockd] root 32 0.0 0.0 0 0 ? S 21:26 0:00 \_ [watchdogd] root 33 0.0 0.0 0 0 ? I 21:26 0:00 \_ [kworker/1:1-events] root 34 0.0 0.0 0 0 ? I 21:26 0:00 \_ [kworker/u4:1-events_unbound] root 36 0.0 0.0 0 0 ? S 21:26 0:00 \_ [kswapd0] root 52 0.0 0.0 0 0 ? I< 21:26 0:00 \_ [kthrotld] root 53 0.0 0.0 0 0 ? I< 21:26 0:00 \_ [ipv6_addrconf] root 62 0.0 0.0 0 0 ? I< 21:26 0:00 \_ [kstrp] root 67 0.0 0.0 0 0 ? I 21:26 0:00 \_ [kworker/1:2-mm_percpu_wq] root 104 0.0 0.0 0 0 ? I< 21:26 0:00 \_ [nvme-wq] root 105 0.0 0.0 0 0 ? I< 21:26 0:00 \_ [nvme-reset-wq] root 106 0.0 0.0 0 0 ? I< 21:26 0:00 \_ [nvme-delete-wq] root 107 0.0 0.0 0 0 ? I 21:26 0:00 \_ [kworker/u4:2-events_unbound] root 115 0.0 0.0 0 0 ? I< 21:26 0:00 \_ [ena] root 184 0.0 0.0 0 0 ? S 21:26 0:00 \_ [jbd2/nvme1n1p1-] root 185 0.0 0.0 0 0 ? I< 21:26 0:00 \_ [ext4-rsv-conver] root 229 0.0 0.0 0 0 ? I< 21:26 0:00 \_ [kworker/0:1H-kblockd] root 257 0.0 0.0 0 0 ? I 21:26 0:00 \_ [kworker/0:2-events] root 528 0.0 0.0 0 0 ? I< 21:26 0:00 \_ [kworker/1:1H-kblockd] root 807 0.0 0.0 0 0 ? I< 21:26 0:00 \_ [kdmflush] root 1254 0.0 0.0 0 0 ? I< 21:26 0:00 \_ [kdmflush] root 1259 0.0 0.0 0 0 ? I< 21:26 0:00 \_ [kcryptd_io] root 1260 0.0 0.0 0 0 ? I< 21:26 0:00 \_ [kcryptd] root 1261 0.0 0.0 0 0 ? S 21:26 0:00 \_ [dmcrypt_write] root 1263 0.0 0.0 0 0 ? I< 21:26 0:00 \_ [kworker/u5:1-kcryptd] root 1264 0.0 0.0 0 0 ? I< 21:26 0:00 \_ [kworker/u5:2-kcryptd] root 1347 0.0 0.0 0 0 ? I< 21:26 0:00 \_ [xfsalloc] root 1348 0.0 0.0 0 0 ? I< 21:26 0:00 \_ [xfs_mru_cache] root 1349 0.0 0.0 0 0 ? I< 21:26 0:00 \_ [xfs-buf/dm-1] root 1350 0.0 0.0 0 0 ? I< 21:26 0:00 \_ [xfs-data/dm-1] root 1351 0.0 0.0 0 0 ? I< 21:26 0:00 \_ [xfs-conv/dm-1] root 1352 0.0 0.0 0 0 ? I< 21:26 0:00 \_ [xfs-cil/dm-1] root 1353 0.0 0.0 0 0 ? I< 21:26 0:00 \_ [xfs-reclaim/dm-] root 1354 0.0 0.0 0 0 ? I< 21:26 0:00 \_ [xfs-log/dm-1] root 1355 0.0 0.0 0 0 ? I< 21:26 0:00 \_ [xfs-eofblocks/d] root 1356 0.0 0.0 0 0 ? S 21:26 0:00 \_ [xfsaild/dm-1] root 3687 0.0 0.0 0 0 ? I 22:17 0:00 \_ [kworker/0:0-events_power_efficient] root 6027 0.0 0.0 0 0 ? I 23:12 0:00 \_ [kworker/0:1-xfs-cil/dm-1] root 6186 0.0 0.0 0 0 ? I< 23:13 0:00 \_ [kworker/u5:0-kcryptd] root 1 0.0 0.5 22116 10224 ? Ss 21:26 0:02 /sbin/init root 247 0.0 0.5 42948 11668 ? Ss 21:26 0:05 /lib/systemd/systemd-journald root 265 0.0 0.2 20068 4880 ? Ss 21:26 0:00 /lib/systemd/systemd-udevd root 381 0.0 0.2 9488 5720 ? Ss 21:26 0:00 /sbin/dhclient -4 -v -i -pf /run/dhclient.ens5.pid -lf /var/lib/dhcp/dhclient.ens5.leases -I -df /var/lib/dhcp/dhclient6.ens5.leases ens5 root 483 0.0 0.2 9488 5644 ? Ss 21:26 0:00 /sbin/dhclient -6 -v -pf /run/dhclient6.ens5.pid -lf /var/lib/dhcp/dhclient6.ens5.leases -I -df /var/lib/dhcp/dhclient.ens5.leases -nw ens5 message+ 579 0.0 0.2 8808 4132 ? Ss 21:26 0:00 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only root 581 0.0 0.3 19520 7336 ? Ss 21:26 0:00 /lib/systemd/systemd-logind root 582 0.0 0.2 225824 4280 ? Ssl 21:26 0:01 /usr/sbin/rsyslogd -n -iNONE root 584 0.0 0.1 10228 2664 ? Ss 21:26 0:00 /usr/sbin/cron -f _chrony 590 0.0 0.1 4816 2228 ? S 21:26 0:00 /usr/sbin/chronyd -F -1 _chrony 594 0.0 0.0 4816 268 ? S 21:26 0:00 \_ /usr/sbin/chronyd -F -1 root 591 0.0 0.0 5612 1688 tty1 Ss+ 21:26 0:00 /sbin/agetty -o -p -- \u --noclear tty1 linux root 592 0.0 0.1 7136 2052 ttyS0 Ss+ 21:26 0:00 /sbin/agetty -o -p -- \u --keep-baud 115200,38400,9600 ttyS0 vt220 root 593 0.0 0.3 15852 6708 ? Ss 21:26 0:00 /usr/sbin/sshd -D root 625 0.0 0.3 16612 7828 ? Ss 21:26 0:00 \_ sshd: admin [priv] admin 651 0.0 0.2 16612 4848 ? S 21:26 0:00 | \_ sshd: admin@notty root 1268 0.0 0.3 16612 7956 ? Ss 21:26 0:00 \_ sshd: admin [priv] admin 1274 0.0 0.2 16784 5348 ? S 21:26 0:01 | \_ sshd: admin@notty root 6025 0.0 0.4 16608 8076 ? Ss 23:12 0:00 \_ sshd: crunch [priv] crunch 6043 0.0 0.2 16608 4940 ? S 23:12 0:00 \_ sshd: crunch@pts/0 crunch 6044 0.0 0.2 9648 4736 pts/0 Ss 23:12 0:00 \_ -bash root 6058 0.0 0.1 11664 3700 pts/0 S 23:12 0:00 \_ sudo -s root 6059 0.0 0.1 8716 3632 pts/0 S 23:12 0:00 \_ /bin/bash root 6222 0.0 0.1 12512 3204 pts/0 R+ 23:14 0:00 \_ ps auxwf admin 628 0.0 0.4 21148 9164 ? Ss 21:26 0:00 /lib/systemd/systemd --user admin 629 0.0 0.1 22888 2304 ? S 21:26 0:00 \_ (sd-pam) root 1315 0.2 2.2 1327060 44344 ? Sl 21:26 0:14 /var/lib/arvados/crunch-run~59eb599fd48e090ea6b99794587612df -no-detach --detach --stdin-env --runtime-engine=singularity tordo-dz642-xm3fbl5he5o2aul root 1325 0.1 2.4 1320308 49556 ? Sl 21:26 0:12 \_ /usr/share/python3/dist/python3-arvados-fuse/bin/python /usr/bin/arv-mount --foreground --allow-other --read-write --storage-classes default --crunchstat-interval=10 --file-cache 268435456 --mount-by-pdh by_id --mount-by-id by_uuid /tmp/crunch-run.tordo-dz642-xm3fbl5he5o2aul.806337907/keep928487990 root 1417 0.1 4.1 1074032 83872 ? Ssl 21:26 0:09 /usr/sbin/dockerd -H fd:// --default-ulimit nofile=10000:10000 --dns 10.253.0.2 root 1424 0.1 1.6 878344 32680 ? Ssl 21:26 0:12 \_ docker-containerd --config /var/run/docker/containerd/containerd.toml --log-level info root 1546 0.0 1.0 32536 21896 ? Ss 21:27 0:00 /usr/share/python3/dist/arvados-docker-cleaner/bin/python /usr/bin/arvados-docker-cleaner crunch 6029 0.0 0.4 21160 9120 ? Ss 23:12 0:00 /lib/systemd/systemd --user crunch 6030 0.0 0.1 23080 2556 ? S 23:12 0:00 \_ (sd-pam)
From /var/log/auth.log:
Aug 6 21:26:56 ip-10-253-254-170 systemd-logind[581]: New session 3 of user admin. Aug 6 21:26:56 ip-10-253-254-170 sudo: admin : TTY=unknown ; PWD=/home/admin ; USER=root ; COMMAND=/usr/bin/cat /var/run/arvados-instance-secret Aug 6 21:26:56 ip-10-253-254-170 sudo: pam_unix(sudo:session): session opened for user root by (uid=0) Aug 6 21:26:56 ip-10-253-254-170 sudo: pam_unix(sudo:session): session closed for user root Aug 6 21:26:56 ip-10-253-254-170 sudo: admin : TTY=unknown ; PWD=/home/admin ; USER=root ; COMMAND=/usr/bin/docker ps -q Aug 6 21:26:56 ip-10-253-254-170 sudo: pam_unix(sudo:session): session opened for user root by (uid=0) Aug 6 21:26:56 ip-10-253-254-170 sudo: pam_unix(sudo:session): session closed for user root Aug 6 21:26:56 ip-10-253-254-170 sudo: admin : TTY=unknown ; PWD=/home/admin ; USER=root ; COMMAND=/usr/bin/sh -c set -e; dstdir="/var/lib/arvados/"; dstfile="/var/lib/arvados/crunch-run~59eb599fd48e090ea6b99794587612df"; mkdir -p "$dstdir"; touch "$dstfile"; chmod 0755 "$dstdir" "$dstfile"; cat >"$dstfile" Aug 6 21:26:56 ip-10-253-254-170 sudo: pam_unix(sudo:session): session opened for user root by (uid=0) Aug 6 21:26:57 ip-10-253-254-170 sudo: pam_unix(sudo:session): session closed for user root Aug 6 21:26:57 ip-10-253-254-170 sudo: admin : TTY=unknown ; PWD=/home/admin ; USER=root ; COMMAND=/var/lib/arvados/crunch-run~59eb599fd48e090ea6b99794587612df --list Aug 6 21:26:57 ip-10-253-254-170 sudo: pam_unix(sudo:session): session opened for user root by (uid=0) Aug 6 21:26:57 ip-10-253-254-170 sudo: pam_unix(sudo:session): session closed for user root Aug 6 21:26:57 ip-10-253-254-170 sudo: admin : TTY=unknown ; PWD=/home/admin ; USER=root ; COMMAND=/var/lib/arvados/crunch-run~59eb599fd48e090ea6b99794587612df --detach --stdin-env --runtime-engine=singularity tordo-dz642-xm3fbl5he5o2aul Aug 6 21:26:57 ip-10-253-254-170 sudo: pam_unix(sudo:session): session opened for user root by (uid=0) Aug 6 21:26:57 ip-10-253-254-170 sudo: pam_unix(sudo:session): session closed for user root Aug 6 21:27:06 ip-10-253-254-170 sudo: admin : TTY=unknown ; PWD=/home/admin ; USER=root ; COMMAND=/var/lib/arvados/crunch-run~59eb599fd48e090ea6b99794587612df --list Aug 6 21:27:06 ip-10-253-254-170 sudo: pam_unix(sudo:session): session opened for user root by (uid=0) Aug 6 21:27:06 ip-10-253-254-170 sudo: pam_unix(sudo:session): session closed for user root Aug 6 21:27:16 ip-10-253-254-170 sudo: admin : TTY=unknown ; PWD=/home/admin ; USER=root ; COMMAND=/var/lib/arvados/crunch-run~59eb599fd48e090ea6b99794587612df --list Aug 6 21:27:16 ip-10-253-254-170 sudo: pam_unix(sudo:session): session opened for user root by (uid=0) Aug 6 21:27:16 ip-10-253-254-170 sudo: pam_unix(sudo:session): session closed for user root Aug 6 21:27:26 ip-10-253-254-170 sudo: admin : TTY=unknown ; PWD=/home/admin ; USER=root ; COMMAND=/var/lib/arvados/crunch-run~59eb599fd48e090ea6b99794587612df --list Aug 6 21:27:26 ip-10-253-254-170 sudo: pam_unix(sudo:session): session opened for user root by (uid=0) Aug 6 21:27:26 ip-10-253-254-170 sudo: pam_unix(sudo:session): session closed for user root Aug 6 21:27:36 ip-10-253-254-170 sudo: admin : TTY=unknown ; PWD=/home/admin ; USER=root ; COMMAND=/var/lib/arvados/crunch-run~59eb599fd48e090ea6b99794587612df --list Aug 6 21:27:36 ip-10-253-254-170 sudo: pam_unix(sudo:session): session opened for user root by (uid=0) Aug 6 21:27:36 ip-10-253-254-170 sudo: pam_unix(sudo:session): session closed for user root Aug 6 21:27:46 ip-10-253-254-170 sudo: admin : TTY=unknown ; PWD=/home/admin ; USER=root ; COMMAND=/var/lib/arvados/crunch-run~59eb599fd48e090ea6b99794587612df --list Aug 6 21:27:46 ip-10-253-254-170 sudo: pam_unix(sudo:session): session opened for user root by (uid=0) Aug 6 21:27:46 ip-10-253-254-170 sudo: pam_unix(sudo:session): session closed for user root Aug 6 21:27:56 ip-10-253-254-170 sudo: admin : TTY=unknown ; PWD=/home/admin ; USER=root ; COMMAND=/var/lib/arvados/crunch-run~59eb599fd48e090ea6b99794587612df --list Aug 6 21:27:56 ip-10-253-254-170 sudo: pam_unix(sudo:session): session opened for user root by (uid=0) Aug 6 21:27:56 ip-10-253-254-170 sudo: pam_unix(sudo:session): session closed for user root .... and on and on and on .... Aug 6 23:25:56 ip-10-253-254-170 sudo: admin : TTY=unknown ; PWD=/home/admin ; USER=root ; COMMAND=/var/lib/arvados/crunch-run~59eb599fd48e090ea6b99794587612df --list Aug 6 23:25:56 ip-10-253-254-170 sudo: pam_unix(sudo:session): session opened for user root by (uid=0) Aug 6 23:25:56 ip-10-253-254-170 sudo: pam_unix(sudo:session): session closed for user root Aug 6 23:26:06 ip-10-253-254-170 sudo: admin : TTY=unknown ; PWD=/home/admin ; USER=root ; COMMAND=/var/lib/arvados/crunch-run~59eb599fd48e090ea6b99794587612df --list Aug 6 23:26:06 ip-10-253-254-170 sudo: pam_unix(sudo:session): session opened for user root by (uid=0) Aug 6 23:26:06 ip-10-253-254-170 sudo: pam_unix(sudo:session): session closed for user root
Running that command manually shows
root@ip-10-253-254-170:/var/log# /var/lib/arvados/crunch-run~59eb599fd48e090ea6b99794587612df --list tordo-dz642-xm3fbl5he5o2aul
So we get the uuid of the container back. That's expected.
I guess something happened to the arv-mount process, it got stuck?
root@ip-10-253-254-170:/var/log# v /proc/1325/fd/ total 0 dr-x------ 2 root root 0 Aug 6 21:26 ./ dr-xr-xr-x 9 root root 0 Aug 6 21:26 ../ lr-x------ 1 root root 64 Aug 6 21:26 0 -> /dev/null l-wx------ 1 root root 64 Aug 6 21:26 1 -> 'pipe:[16552]' l-wx------ 1 root root 64 Aug 6 21:26 2 -> 'pipe:[16553]' lrwx------ 1 root root 64 Aug 6 21:26 3 -> /run/lock/crunch-run-tordo-dz642-xm3fbl5he5o2aul.lock* lrwx------ 1 root root 64 Aug 6 21:26 4 -> 'socket:[17495]' lrwx------ 1 root root 64 Aug 6 23:34 5 -> /dev/fuse lrwx------ 1 root root 64 Aug 6 23:34 6 -> 'socket:[17503]' root@ip-10-253-254-170:/var/log# v /run/lock/crunch-run-tordo-dz642-xm3fbl5he5o2aul.lock -rwx------ 1 root root 50 Aug 6 21:26 /run/lock/crunch-run-tordo-dz642-xm3fbl5he5o2aul.lock*
Updated by Ward Vandewege over 3 years ago
Trying to cancel this job, the cancel button in wb doesn't work, this is the error I see in the controller logs:
Aug 06 23:41:16 tordo.arvadosapi.com arvados-controller[13414]: {"PID":13414,"RequestID":"req-3x51wipebp66b8pxx8hg","level":"info","msg":"response","remoteAddr":"127.0.0.1:41130","reqBytes":1982,"reqForwardedFor":"10.253.0.41","reqHost":"tordo.arvadosapi.com","reqMethod":"POST","reqPath":"arvados/v1/container_requests/tordo-xvhdp-iwwdqh2w5wyprix","reqQuery":"","respBody":"{\"errors\":[\"//railsapi.internal/arvados/v1/container_requests/tordo-xvhdp-iwwdqh2w5wyprix: 422 Unprocessable Entity: Mounts cannot be modified in state 'Final' ({\\\"/keep/35d4f4c5720a1e1c7b5c583cc901c3e0+3765\\\"=\\u003e{\\\"kind\\\"=\\u003e\\\"collection\\\", \\\"portable_data_hash\\\"=\\u003e\\\"35d4f4c5720a1e1c7b5c583cc901c3e0+3765\\\"}, \\\"/tmp\\\"=\\u003e{\\\"capacity\\\"=\\u003e1073741824, \\\"kind\\\"=\\u003e\\\"tmp\\\"}, \\\"/var/spool/cwl\\\"=\\u003e{\\\"capacity\\\"=\\u003e1073741824, \\\"kind\\\"=\\u003e\\\"tmp\\\"}}, {\\\"/keep/35d4f4c5720a1e1c7b5c583cc901c3e0+3765\\\"=\\u003e{\\\"capacity\\\"=\\u003e0, \\\"commit\\\"=\\u003e\\\"\\\", \\\"content\\\"=\\u003enil, \\\"device_type\\\"=\\u003e\\\"\\\", \\\"exclude_from_output\\\"=\\u003efalse, \\\"git_url\\\"=\\u003e\\\"\\\", \\\"kind\\\"=\\u003e\\\"collection\\\", \\\"path\\\"=\\u003e\\\"\\\", \\\"portable_data_hash\\\"=\\u003e\\\"35d4f4c5720a1e1c7b5c583cc901c3e0+3765\\\", \\\"repository_name\\\"=\\u003e\\\"\\\", \\\"uuid\\\"=\\u003e\\\"\\\", \\\"writable\\\"=\\u003efalse}, \\\"/tmp\\\"=\\u003e{\\\"capacity\\\"=\\u003e1073741824, \\\"commit\\\"=\\u003e\\\"\\\", \\\"content\\\"=\\u003enil, \\\"device_type\\\"=\\u003e\\\"\\\", \\\"exclude_","respBytes":1618,"respStatus":"Unprocessable Entity","respStatusCode":422,"time":"2021-08-06T23:41:16.438539926Z","timeToStatus":0.018596,"timeTotal":0.018626,"timeWriteBody":0.000030}
This sounds vaguely familiar. This is a second bug, then.
I'll kill the node with a-d-c.
curl -H "Authorization: Bearer $TOKEN" -X GET http://10.253.0.44:9006/arvados/v1/dispatch/instances |jq . { "items": [ { "instance": "i-0321dc62106d21d98", "address": "10.253.254.104", "price": 0.0208, "arvados_instance_type": "t3small", "provider_instance_type": "t3.small", "last_container_uuid": "tordo-dz642-mitu9yi55h7femh", "last_busy": "2021-08-07T00:07:16.063745476Z", "worker_state": "running", "idle_behavior": "run" }, { "instance": "i-0362e2a61c62cfaf2", "address": "10.253.254.170", "price": 0.0208, "arvados_instance_type": "t3small", "provider_instance_type": "t3.small", "last_container_uuid": "tordo-dz642-xm3fbl5he5o2aul", "last_busy": "2021-08-07T00:07:16.078001157Z", "worker_state": "running", "idle_behavior": "run" } ] } tordo:~# curl -H "Authorization: Bearer $TOKEN" -X POST http://10.253.0.44:9006/arvados/v1/dispatch/instances/kill?instance_id=i-0362e2a61c62cfaf2 |jq . tordo:~# curl -H "Authorization: Bearer $TOKEN" -X GET http://10.253.0.44:9006/arvados/v1/dispatch/instances |jq . { "items": [ { "instance": "i-0321dc62106d21d98", "address": "10.253.254.104", "price": 0.0208, "arvados_instance_type": "t3small", "provider_instance_type": "t3.small", "last_container_uuid": "tordo-dz642-mitu9yi55h7femh", "last_busy": "2021-08-07T00:07:56.023842384Z", "worker_state": "running", "idle_behavior": "run" }, { "instance": "i-0362e2a61c62cfaf2", "address": "10.253.254.170", "price": 0.0208, "arvados_instance_type": "t3small", "provider_instance_type": "t3.small", "last_container_uuid": "tordo-dz642-xm3fbl5he5o2aul", "last_busy": "2021-08-07T00:07:46.041616399Z", "worker_state": "shutdown", "idle_behavior": "run" } ] }
A new node was created:
tordo:~# curl -H "Authorization: Bearer $TOKEN" -X GET http://10.253.0.44:9006/arvados/v1/dispatch/instances |jq . { "items": [ { "instance": "i-0321dc62106d21d98", "address": "10.253.254.104", "price": 0.0208, "arvados_instance_type": "t3small", "provider_instance_type": "t3.small", "last_container_uuid": "tordo-dz642-mitu9yi55h7femh", "last_busy": "2021-08-07T00:08:56.02153052Z", "worker_state": "running", "idle_behavior": "run" }, { "instance": "i-06dd557531f9df013", "address": "10.253.254.243", "price": 0.0208, "arvados_instance_type": "t3small", "provider_instance_type": "t3.small", "last_container_uuid": "tordo-dz642-xm3fbl5he5o2aul", "last_busy": "2021-08-07T00:08:56.047446083Z", "worker_state": "running", "idle_behavior": "run" } ] }
And this node ran the job without trouble: https://workbench.tordo.arvadosapi.com/container_requests/tordo-xvhdp-iwwdqh2w5wyprix#Status.
I think we have a missing error check somewhere early in the singularity docker loading code. There was a (transient?) problem on this particular node, and that made the container get stuck.
Updated by Peter Amstutz over 3 years ago
The first two failures I think are due to a race condition between two crunch-run processes trying to convert and caching the singularity image.
This attempt found the collection, but apparently before the cached image had been created. (not supposed to happen.)
2021-08-06T21:26:16.608212665Z Using Docker image id "sha256:337550d506a3fc77e30292bba95108f1cd34a33719f0dd997d0de0540522def7"
2021-08-06T21:26:16.608240987Z Loading Docker image from keep
2021-08-06T21:26:17.188261275Z building singularity image
2021-08-06T21:26:17.223285265Z [singularity build /tmp/crunch-run.tordo-dz642-pez87oegh5fgbs7.099258167/keep207354666/by_uuid/tordo-4zz18-t0wx4utpwx4ligf/image.sif docker-archive:///tmp/crunch-run-singularity-917581112/image.tar]
2021-08-06T21:26:26.466105517Z INFO: Starting build...
2021-08-06T21:26:26.466105517Z FATAL: While performing build: conveyor failed to get: Error loading tar component 337550d506a3fc77e30292bba95108f1cd34a33719f0dd997d0de0540522def7.json: open /tmp/crunch-run-singularity-917581112/image.tar: no such file or directory
2021-08-06T21:26:26.466234171Z error in Run: While loading container image: exit status 255
2021-08-06T21:26:26.466268708Z error in CaptureOutput: error scanning files to copy to output: lstat "/var/spool/cwl": lstat /tmp/crunch-run.tordo-dz642-pez87oegh5fgbs7.099258167/tmp701045228: no such file or directory
2021-08-06T21:26:26.605452965Z Cancelled
On the second attempt, it tried to create a collection with the same temporary name (down to the exact timestamp?) and that failed.
2021-08-06T21:26:47.149336231Z Executing container 'tordo-dz642-amjt50vnz4qyn4n'
...
2021-08-06T21:26:47.972965997Z error in Run: While loading container image: error creating 'singularity image for sha256:337550d506a3fc77e30292bba95108f1cd34a33719f0dd997d0de0540522def7 2021-08-06T21:26:47Z' collection: request failed: https://tordo.arvadosapi.com/arvados/v1/collections: 422 Unprocessable Entity: //railsapi.internal/arvados/v1/collections: 422 Unprocessable Entity: #<ActiveRecord::RecordNotUnique: PG::UniqueViolation: ERROR: duplicate key value violates unique constraint "index_collections_on_owner_uuid_and_name"
2021-08-06T21:26:47.972965997Z DETAIL: Key (owner_uuid, name)=(tordo-j7d0g-7p82g804nk5l7gx, singularity image for sha256:337550d506a3fc77e30292bba95108f1cd34a33719f0dd997d0de0540522def7 2021-08-06T21:26:47Z) already exists.
2021-08-06T21:26:47.972965997Z : INSERT INTO "collections" ("owner_uuid", "created_at", "modified_by_user_uuid", "modified_at", "portable_data_hash", "updated_at", "uuid", "manifest_text", "name", "properties", "delete_at", "file_names", "trash_at", "current_version_uuid") VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14) RETURNING "id"> (req-8t57dqc95orqsvelydce)
2021-08-06T21:26:48.136224600Z Cancelled
Updated by Peter Amstutz over 3 years ago
The 3rd failure is puzzling, if you check stderr it actually did run to the end, but failed the upload. This happened after the intervention of being stuck in queued state and killing the compute node.
2021-08-07T00:08:37.343893355Z Executing container 'tordo-dz642-xm3fbl5he5o2aul'
...
2021-08-07T00:08:39.823076366Z Waiting for container to finish
2021-08-07T00:10:47.551657400Z error in CaptureOutput: error scanning files to copy to output: lstat "/var/spool/cwl": lstat /tmp/crunch-run.tordo-dz642-xm3fbl5he5o2aul.609138299/tmp536330368: no such file or directory
2021-08-07T00:10:47.568368336Z Cancelled
Updated by Peter Amstutz over 3 years ago
Was the "queued" container actually "locked"? (probably but would be good to confirm).
Because it didn't get far enough, the container record was reused for the 2nd attempt, all the logs in keep are only the second attempt. Fortunately there are logs stored on the API server:
https://workbench.tordo.arvadosapi.com/containers/tordo-dz642-xm3fbl5he5o2aul#Log
Although they don't tell us very much. It hung sometime after starting arv-mount.
This might be related to the event handling deadlock we've been seeing.
Updated by Ward Vandewege over 3 years ago
Peter Amstutz wrote:
Was the "queued" container actually "locked"? (probably but would be good to confirm).
Because it didn't get far enough, the container record was reused for the 2nd attempt, all the logs in keep are only the second attempt. Fortunately there are logs stored on the API server:
https://workbench.tordo.arvadosapi.com/containers/tordo-dz642-xm3fbl5he5o2aul#Log
Although they don't tell us very much. It hung sometime after starting arv-mount.
This might be related to the event handling deadlock we've been seeing.
I'll make a new compute image before doing further testing: ami-0ba36796c859d41a2 has the arv-mount fix built in.
Updated by Ward Vandewege over 3 years ago
- Related to Bug #18027: [crunch-run] upload failed (singularity) added
Updated by Ward Vandewege over 3 years ago
- Related to Bug #18026: [singularity] crunch-run image cache race condition added
Updated by Ward Vandewege over 3 years ago
Peter Amstutz wrote:
Was the "queued" container actually "locked"? (probably but would be good to confirm).
Because it didn't get far enough, the container record was reused for the 2nd attempt, all the logs in keep are only the second attempt. Fortunately there are logs stored on the API server:
https://workbench.tordo.arvadosapi.com/containers/tordo-dz642-xm3fbl5he5o2aul#Log
Although they don't tell us very much. It hung sometime after starting arv-mount.
This might be related to the event handling deadlock we've been seeing.
Now that the fix for the arv-mount deadlock has been merged, I've built a new compute image with that fix (ami-0ba36796c859d41a2), and kicked off a new test run with
ward@shell:~/arvados-tutorial/WGS-processing$ arvados-cwl-runner --project-uuid tordo-j7d0g-q56odvtnbo37ajt ./cwl/wgs-processing-wf.cwl ./yml/wgs-processing-wf-chr19.yml INFO /usr/bin/arvados-cwl-runner 2.3.0.dev20210813182157, arvados-python-client 2.3.0.dev20210810185721, cwltool 3.0.20210319143721 INFO Resolved './cwl/wgs-processing-wf.cwl' to 'file:///home/ward/arvados-tutorial/WGS-processing/cwl/wgs-processing-wf.cwl' INFO Using cluster tordo (https://workbench2.tordo.arvadosapi.com/) INFO Upload local files: "tail.html" "head.html" "generatereport.py" INFO Using collection 4fce132bc4ae9b4cc844115246a6bd41+175 (tordo-4zz18-9p4undcuyh75qot) INFO Using collection cache size 256 MiB INFO [container WGS processing workflow scattered over samples] submitted container_request tordo-xvhdp-911ekye8xxwcb7r INFO Monitor workflow progress at https://workbench2.tordo.arvadosapi.com/processes/tordo-xvhdp-911ekye8xxwcb7r INFO [container WGS processing workflow scattered over samples] tordo-xvhdp-911ekye8xxwcb7r is Final INFO Overall process status is success INFO Final output collection 431b75fdad437345c678fe58163ad2ec+1243 INFO Output at https://workbench2.tordo.arvadosapi.com/collections/431b75fdad437345c678fe58163ad2ec+1243 { "gvcf": [ { "basename": "ERR2122553.g.vcf.gz", "class": "File", "format": "http://edamontology.org/format_3016", "location": "keep:431b75fdad437345c678fe58163ad2ec+1243/ERR2122553.g.vcf.gz", "secondaryFiles": [ { "basename": "ERR2122553.g.vcf.gz.tbi", "class": "File", "location": "keep:431b75fdad437345c678fe58163ad2ec+1243/ERR2122553.g.vcf.gz.tbi" } ], "size": 100428289 }, { "basename": "ERR2122554.g.vcf.gz", "class": "File", "format": "http://edamontology.org/format_3016", "location": "keep:431b75fdad437345c678fe58163ad2ec+1243/ERR2122554.g.vcf.gz", "secondaryFiles": [ { "basename": "ERR2122554.g.vcf.gz.tbi", "class": "File", "location": "keep:431b75fdad437345c678fe58163ad2ec+1243/ERR2122554.g.vcf.gz.tbi" } ], "size": 83396874 }, { "basename": "ERR2122555.g.vcf.gz", "class": "File", "format": "http://edamontology.org/format_3016", "location": "keep:431b75fdad437345c678fe58163ad2ec+1243/ERR2122555.g.vcf.gz", "secondaryFiles": [ { "basename": "ERR2122555.g.vcf.gz.tbi", } ], "size": 100428289 }, { "basename": "ERR2122554.g.vcf.gz", "class": "File", "format": "http://edamontology.org/format_3016", "location": "keep:431b75fdad437345c678fe58163ad2ec+1243/ERR2122554.g.vcf.gz", "secondaryFiles": [ { "basename": "ERR2122554.g.vcf.gz.tbi", "class": "File", "location": "keep:431b75fdad437345c678fe58163ad2ec+1243/ERR2122554.g.vcf.gz.tbi" } ], "size": 83396874 }, { "basename": "ERR2122555.g.vcf.gz", "class": "File", "format": "http://edamontology.org/format_3016", "location": "keep:431b75fdad437345c678fe58163ad2ec+1243/ERR2122555.g.vcf.gz", "secondaryFiles": [ { "basename": "ERR2122555.g.vcf.gz.tbi", "class": "File", "location": "keep:431b75fdad437345c678fe58163ad2ec+1243/ERR2122555.g.vcf.gz.tbi" } ], "size": 68070153 }, { "basename": "ERR2122556.g.vcf.gz", "class": "File", "format": "http://edamontology.org/format_3016", "location": "keep:431b75fdad437345c678fe58163ad2ec+1243/ERR2122556.g.vcf.gz", "secondaryFiles": [ { "basename": "ERR2122556.g.vcf.gz.tbi", "class": "File", "location": "keep:431b75fdad437345c678fe58163ad2ec+1243/ERR2122556.g.vcf.gz.tbi" } ], "size": 80162449 }, { "basename": "ERR2122557.g.vcf.gz", "class": "File", "format": "http://edamontology.org/format_3016", "location": "keep:431b75fdad437345c678fe58163ad2ec+1243/ERR2122557.g.vcf.gz", "secondaryFiles": [ { "basename": "ERR2122557.g.vcf.gz.tbi", "class": "File", "location": "keep:431b75fdad437345c678fe58163ad2ec+1243/ERR2122557.g.vcf.gz.tbi" } ], "size": 92916518 } ], "report": [ { "basename": "ERR2122553.html", "class": "File", "format": "http://edamontology.org/format_2331", "location": "keep:431b75fdad437345c678fe58163ad2ec+1243/ERR2122553.html", "size": 213477 }, { "basename": "ERR2122554.html", "class": "File", "format": "http://edamontology.org/format_2331", "location": "keep:431b75fdad437345c678fe58163ad2ec+1243/ERR2122554.html", "size": 203560 }, { "basename": "ERR2122555.html", "class": "File", "format": "http://edamontology.org/format_2331", "location": "keep:431b75fdad437345c678fe58163ad2ec+1243/ERR2122555.html", "size": 224649 }, { "basename": "ERR2122556.html", "class": "File", "format": "http://edamontology.org/format_2331", "location": "keep:431b75fdad437345c678fe58163ad2ec+1243/ERR2122556.html", "size": 210935 }, { "basename": "ERR2122557.html", "class": "File", "format": "http://edamontology.org/format_2331", "location": "keep:431b75fdad437345c678fe58163ad2ec+1243/ERR2122557.html", "size": 225496 } ] } INFO Final process status is success
Hmm, that reused everything, not quite what I wanted.
Trying again:
ward@shell:~/arvados-tutorial/WGS-processing$ arvados-cwl-runner --disable-reuse --project-uuid tordo-j7d0g-q56odvtnbo37ajt ./cwl/wgs-processing-wf.cwl ./yml/wgs-processing-wf-chr19.yml INFO /usr/bin/arvados-cwl-runner 2.3.0.dev20210813182157, arvados-python-client 2.3.0.dev20210810185721, cwltool 3.0.20210319143721 INFO Resolved './cwl/wgs-processing-wf.cwl' to 'file:///home/ward/arvados-tutorial/WGS-processing/cwl/wgs-processing-wf.cwl' INFO Using cluster tordo (https://workbench2.tordo.arvadosapi.com/) INFO Upload local files: "generatereport.py" "tail.html" "head.html" INFO Using collection 4fce132bc4ae9b4cc844115246a6bd41+175 (tordo-4zz18-9p4undcuyh75qot) INFO Using collection cache size 256 MiB INFO [container WGS processing workflow scattered over samples] submitted container_request tordo-xvhdp-bb8pjmzh93hkzmz INFO Monitor workflow progress at https://workbench2.tordo.arvadosapi.com/processes/tordo-xvhdp-bb8pjmzh93hkzmz ...
This run completed successfully without obvious issues: https://workbench.tordo.arvadosapi.com/container_requests/tordo-xvhdp-bb8pjmzh93hkzmz.
This container started at 5:09 PM 8/17/2021. It completed in 1h18m at 6:27 PM 8/17/2021. It has runtime of 1h16m(2mqueued) and used 5h18m of node allocation time (4.2тип scaling).
Updated by Peter Amstutz over 3 years ago
- Target version changed from 2021-08-18 sprint to 2021-09-01 sprint
Updated by Peter Amstutz over 3 years ago
- Target version changed from 2021-09-01 sprint to 2021-09-15 sprint
Updated by Ward Vandewege over 3 years ago
17755-add-singularity-to-compute-image is ready for review at 72d7d41944006d1f48f570784dafe56b9812b0c8
It builds singularity from source, it would be nice if there was a better way.
Updated by Ward Vandewege over 3 years ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|9695271cd15811612cb25f53fb38f133f747cd3a.