Bug #18311

[cwl] test 221 in the 1.2 conformance suite is failing on singularity

Added by Ward Vandewege 7 months ago. Updated 16 days ago.

Status:
New
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
(Total: 0.00 h)
Story points:
-
Release relationship:
Auto

Description

On our singularity clusters (9tee4 and tordo, currently), the failure is (from run-tests-cwl-suite: #37 /console):

21:57:52 Test 221 failed: /usr/bin/arvados-cwl-runner --compute-checksum --disable-reuse --eval-timeout 60 --outdir=/tmp/tmpxpfu_s_d --quiet tests/timelimit2-wf.cwl tests/empty.json
21:57:52 Test that workflow level time limit is not applied to workflow execution time
21:57:52 Returned non-zero
21:57:52 ERROR [container timelimit2-wf.cwl] (9tee4-dz642-j9qil6yio4ms1ku) error log:
21:57:52 
21:57:52   2021-10-29T01:56:31.148851065Z crunch-run Loading Docker image from keep
21:57:52   2021-10-29T01:56:31.641354946Z crunch-run Starting container
21:57:52   2021-10-29T01:56:31.642351277Z crunch-run Waiting for container to finish
21:57:52   2021-10-29T01:56:37.397845753Z stderr INFO /usr/bin/arvados-cwl-runner 2.3.0.dev20211020182823, arvados-python-client 2.3.0.dev20211013202728, cwltool 3.1.20211020155521
21:57:52   2021-10-29T01:56:37.414865353Z stderr INFO Resolved '/var/lib/cwl/workflow.json#main' to 'file:///var/lib/cwl/workflow.json#main'
21:57:52   2021-10-29T01:56:45.046750708Z stderr INFO Using cluster 9tee4 (https://workbench2.9tee4.arvadosapi.com/)
21:57:52   2021-10-29T01:56:45.558323744Z stderr INFO Upload local files: "workflow.json" 
21:57:52   2021-10-29T01:56:45.758393815Z stderr INFO Uploaded to f9d6a1cda3285e35b4b70234c4712b95+61 (9tee4-4zz18-wn85zp8etf12ora)
21:57:52   2021-10-29T01:56:45.763297415Z stderr INFO Upload local files: "workflow.json" 
21:57:52   2021-10-29T01:56:45.887338055Z stderr INFO Uploaded to f9d6a1cda3285e35b4b70234c4712b95+61 (9tee4-4zz18-35bm0g8ul56y4no)
21:57:52   2021-10-29T01:56:50.325226086Z stderr INFO Using collection cache size 256 MiB
21:57:52   2021-10-29T01:56:50.574797297Z stderr INFO Running inside container 9tee4-dz642-j9qil6yio4ms1ku
21:57:52   2021-10-29T01:56:50.596940754Z stderr INFO [workflow workflow.json#main] start
21:57:52   2021-10-29T01:56:50.597164015Z stderr INFO [workflow workflow.json#main] starting step step1
21:57:52   2021-10-29T01:56:50.597407019Z stderr INFO [step step1] start
21:57:52   2021-10-29T01:56:50.759170131Z stderr INFO [container step1] 9tee4-xvhdp-njas6p46cvav01s state is Committed
21:57:52   2021-10-29T01:57:38.633382081Z stderr INFO [container step1] 9tee4-xvhdp-njas6p46cvav01s is Final
21:57:52   2021-10-29T01:57:38.720112167Z stderr ERROR [container step1] (9tee4-xvhdp-njas6p46cvav01s) error log:
21:57:52   2021-10-29T01:57:38.720112167Z stderr
21:57:52   2021-10-29T01:57:38.720112167Z stderr   2021-10-29T01:57:20.484569438Z crunch-run Not starting a gateway server (GatewayAuthSecret was not provided by dispatcher)
21:57:52   2021-10-29T01:57:38.720112167Z stderr   2021-10-29T01:57:20.484755292Z crunch-run crunch-run 2.3.0 (go1.17.1) started
21:57:52   2021-10-29T01:57:38.720112167Z stderr   2021-10-29T01:57:20.484776349Z crunch-run Executing container '9tee4-dz642-szbmvhrdakmb0ha' using singularity runtime
21:57:52   2021-10-29T01:57:38.720112167Z stderr   2021-10-29T01:57:20.484839635Z crunch-run Executing on host 'compute1.9tee4.arvadosapi.com'
21:57:52   2021-10-29T01:57:38.720112167Z stderr   2021-10-29T01:57:20.604062194Z crunch-run container token "v2/9tee4-gj3su-eaaf5ru6ohnw7d2/36sq5h3y5f0qo1s7qlgzrn9hfn4h851ufstfkpjku5sh76i92j/9tee4-dz642-szbmvhrdakmb0ha" 
21:57:52   2021-10-29T01:57:38.720112167Z stderr   2021-10-29T01:57:20.604359220Z crunch-run Running [arv-mount --foreground --read-write --storage-classes default --crunchstat-interval=10 --file-cache 268435456 --mount-by-pdh by_id --mount-by-id by_uuid /tmp/crunch-run.9tee4-dz642-szbmvhrdakmb0ha.3491659512/keep4170174168]
21:57:52   2021-10-29T01:57:38.720112167Z stderr   2021-10-29T01:57:21.510019953Z crunch-run Fetching Docker image from collection 'd2a6f06e1f8e3e7d72b8ee89622e9f96+261'
21:57:52   2021-10-29T01:57:38.720112167Z stderr   2021-10-29T01:57:21.576950500Z crunch-run Using Docker image id "sha256:61064933b465210ec06517e95e64b0909841d4ccf037552266d7079baed45a6e" 
21:57:52   2021-10-29T01:57:38.720112167Z stderr   2021-10-29T01:57:21.576984357Z crunch-run Loading Docker image from keep
21:57:52   2021-10-29T01:57:38.720112167Z stderr   2021-10-29T01:57:22.013919659Z crunch-run Starting container
21:57:52   2021-10-29T01:57:38.720112167Z stderr   2021-10-29T01:57:22.014675984Z crunch-run Waiting for container to finish
21:57:52   2021-10-29T01:57:38.720112167Z stderr   2021-10-29T01:57:27.015064319Z crunch-run maximum run time exceeded. Stopping container.
21:57:52   2021-10-29T01:57:38.720112167Z stderr   2021-10-29T01:57:27.015148637Z crunch-run stopping container
21:57:52   2021-10-29T01:57:38.720112167Z stderr   2021-10-29T01:57:27.123407016Z crunch-run Cancelled
21:57:52   2021-10-29T01:57:38.817582951Z stderr ERROR [step step1] Output is missing expected field file:///var/lib/cwl/workflow.json#main/step1/o
21:57:52   2021-10-29T01:57:38.900999492Z stderr WARNING [step step1] completed permanentFail
21:57:52   2021-10-29T01:57:38.976870679Z stderr INFO [workflow workflow.json#main] completed permanentFail
21:57:52   2021-10-29T01:57:38.976917303Z stderr ERROR Overall process status is permanentFail
21:57:52   2021-10-29T01:57:39.142930716Z stderr INFO Final output collection 42866e7c2f47b2fd38dca68903e59c4b+59 "Output of main (2021-10-29T01:57:39.107Z)" (9tee4-4zz18-yr16og3bk6qwk3q)
21:57:52   2021-10-29T01:57:39.374321052Z stderr WARNING Final process status is permanentFail
21:57:52   2021-10-29T01:57:39.956477426Z crunch-run Complete
21:57:52 ERROR Overall process status is permanentFail
21:57:52 WARNING Final process status is permanentFail

And on 9tee4 (from run-tests-cwl-suite: #40 /console):

11:24:02 Test 221 failed: /usr/bin/arvados-cwl-runner --compute-checksum --disable-reuse --eval-timeout 60 --outdir=/tmp/tmp_lnu_8e_ --quiet tests/timelimit2-wf.cwl tests/empty.json
11:24:02 Test that workflow level time limit is not applied to workflow execution time
11:24:02 Returned non-zero
11:24:02 ERROR [container timelimit2-wf.cwl] (9tee4-dz642-tggdcuakup1jfq5) error log:
11:24:02 
11:24:02   2021-10-29T15:22:53.175747990Z crunch-run Loading Docker image from keep
11:24:02   2021-10-29T15:22:53.849775581Z crunch-run Starting container
11:24:02   2021-10-29T15:22:53.850902104Z crunch-run Waiting for container to finish
11:24:02   2021-10-29T15:23:05.619135487Z stderr INFO /usr/bin/arvados-cwl-runner 2.3.0.dev20211020182823, arvados-python-client 2.3.0.dev20211013202728, cwltool 3.1.20211020155521
11:24:02   2021-10-29T15:23:05.636148789Z stderr INFO Resolved '/var/lib/cwl/workflow.json#main' to 'file:///var/lib/cwl/workflow.json#main'
11:24:02   2021-10-29T15:23:14.022887185Z stderr INFO Using cluster 9tee4 (https://workbench2.9tee4.arvadosapi.com/)
11:24:02   2021-10-29T15:23:14.279142453Z stderr INFO Upload local files: "workflow.json" 
11:24:02   2021-10-29T15:23:14.428674656Z stderr INFO Uploaded to f9d6a1cda3285e35b4b70234c4712b95+61 (9tee4-4zz18-9zab9j0xsgvigar)
11:24:02   2021-10-29T15:23:14.435330631Z stderr INFO Upload local files: "workflow.json" 
11:24:02   2021-10-29T15:23:14.571627543Z stderr INFO Uploaded to f9d6a1cda3285e35b4b70234c4712b95+61 (9tee4-4zz18-gjzewquov369heg)
11:24:02   2021-10-29T15:23:19.004648013Z stderr INFO Using collection cache size 256 MiB
11:24:02   2021-10-29T15:23:19.030485917Z stderr INFO Running inside container 9tee4-dz642-tggdcuakup1jfq5
11:24:02   2021-10-29T15:23:19.052813153Z stderr INFO [workflow workflow.json#main] start
11:24:02   2021-10-29T15:23:19.053081383Z stderr INFO [workflow workflow.json#main] starting step step1
11:24:02   2021-10-29T15:23:19.053338000Z stderr INFO [step step1] start
11:24:02   2021-10-29T15:23:19.230789355Z stderr INFO [container step1] 9tee4-xvhdp-qojf0j4ca8lyvnf state is Committed
11:24:02   2021-10-29T15:23:55.085180501Z stderr INFO [container step1] 9tee4-xvhdp-qojf0j4ca8lyvnf is Final
11:24:02   2021-10-29T15:23:55.176969339Z stderr ERROR [container step1] (9tee4-xvhdp-qojf0j4ca8lyvnf) error log:
11:24:02   2021-10-29T15:23:55.176969339Z stderr
11:24:02   2021-10-29T15:23:55.176969339Z stderr   2021-10-29T15:23:45.345522001Z crunch-run Not starting a gateway server (GatewayAuthSecret was not provided by dispatcher)
11:24:02   2021-10-29T15:23:55.176969339Z stderr   2021-10-29T15:23:45.345759158Z crunch-run crunch-run 2.3.0 (go1.17.1) started
11:24:02   2021-10-29T15:23:55.176969339Z stderr   2021-10-29T15:23:45.345779799Z crunch-run Executing container '9tee4-dz642-e1vpz4lmmpzjdr4' using singularity runtime
11:24:02   2021-10-29T15:23:55.176969339Z stderr   2021-10-29T15:23:45.345801566Z crunch-run Executing on host 'compute1.9tee4.arvadosapi.com'
11:24:02   2021-10-29T15:23:55.176969339Z stderr   2021-10-29T15:23:45.459713326Z crunch-run container token "v2/9tee4-gj3su-ewn10p08oeckpxj/1eqyw1uioulmxl78olk0i743ipt4nro0neksh32ee48ks8rjux/9tee4-dz642-e1vpz4lmmpzjdr4" 
11:24:02   2021-10-29T15:23:55.176969339Z stderr   2021-10-29T15:23:45.460034932Z crunch-run Running [arv-mount --foreground --read-write --storage-classes default --crunchstat-interval=10 --file-cache 268435456 --mount-by-pdh by_id --mount-by-id by_uuid /tmp/crunch-run.9tee4-dz642-e1vpz4lmmpzjdr4.734327895/keep1570108756]
11:24:02   2021-10-29T15:23:55.176969339Z stderr   2021-10-29T15:23:46.365999816Z crunch-run Fetching Docker image from collection 'd2a6f06e1f8e3e7d72b8ee89622e9f96+261'
11:24:02   2021-10-29T15:23:55.176969339Z stderr   2021-10-29T15:23:46.430319132Z crunch-run Using Docker image id "sha256:61064933b465210ec06517e95e64b0909841d4ccf037552266d7079baed45a6e" 
11:24:02   2021-10-29T15:23:55.176969339Z stderr   2021-10-29T15:23:46.430387977Z crunch-run Loading Docker image from keep
11:24:02   2021-10-29T15:23:55.176969339Z stderr   2021-10-29T15:23:47.011343535Z crunch-run Starting container
11:24:02   2021-10-29T15:23:55.176969339Z stderr   2021-10-29T15:23:47.012153992Z crunch-run Waiting for container to finish
11:24:02   2021-10-29T15:23:55.176969339Z stderr   2021-10-29T15:23:52.012532501Z crunch-run maximum run time exceeded. Stopping container.
11:24:02   2021-10-29T15:23:55.176969339Z stderr   2021-10-29T15:23:52.012641954Z crunch-run stopping container
11:24:02   2021-10-29T15:23:55.176969339Z stderr   2021-10-29T15:23:52.123251411Z crunch-run Cancelled
11:24:02   2021-10-29T15:23:55.259792247Z stderr ERROR [step step1] Output is missing expected field file:///var/lib/cwl/workflow.json#main/step1/o
11:24:02   2021-10-29T15:23:55.344468592Z stderr WARNING [step step1] completed permanentFail
11:24:02   2021-10-29T15:23:55.423478553Z stderr INFO [workflow workflow.json#main] completed permanentFail
11:24:02   2021-10-29T15:23:55.423519944Z stderr ERROR Overall process status is permanentFail
11:24:02   2021-10-29T15:23:55.572398801Z stderr INFO Final output collection 42866e7c2f47b2fd38dca68903e59c4b+59 "Output of main (2021-10-29T15:23:55.541Z)" (9tee4-4zz18-smx5y5kzo0oyh11)
11:24:02   2021-10-29T15:23:55.804266214Z stderr WARNING Final process status is permanentFail
11:24:02   2021-10-29T15:23:56.277148741Z crunch-run Complete
11:24:02 ERROR Overall process status is permanentFail
11:24:02 WARNING Final process status is permanentFail

Compare with ce8i5 where it passes (run-tests-cwl-suite: #38 ).


Subtasks

Task #18615: ReviewNewLucas Di Pentima

History

#1 Updated by Ward Vandewege 7 months ago

  • Description updated (diff)

#2 Updated by Peter Amstutz 6 months ago

  • Target version changed from 2021-11-10 sprint to 2021-11-24 sprint

#3 Updated by Peter Amstutz 6 months ago

  • Target version changed from 2021-11-24 sprint to 2021-12-08 sprint

#4 Updated by Peter Amstutz 6 months ago

  • Target version changed from 2021-12-08 sprint to 2022-01-05 sprint

#5 Updated by Peter Amstutz 5 months ago

  • Target version changed from 2022-01-05 sprint to 2022-01-19 sprint

#6 Updated by Peter Amstutz 4 months ago

  • Target version changed from 2022-01-19 sprint to 2022-02-02 sprint

#7 Updated by Peter Amstutz 4 months ago

  • Target version changed from 2022-02-02 sprint to 2022-02-16 sprint

#8 Updated by Peter Amstutz 3 months ago

  • Target version changed from 2022-02-16 sprint to 2022-03-02 sprint

#9 Updated by Peter Amstutz 3 months ago

  • Target version changed from 2022-03-02 sprint to 2022-03-16 sprint

#10 Updated by Peter Amstutz 3 months ago

  • Target version changed from 2022-03-16 sprint to 2022-03-30 Sprint

#11 Updated by Peter Amstutz 2 months ago

  • Release set to 46

#12 Updated by Peter Amstutz 2 months ago

  • Target version changed from 2022-03-30 Sprint to 2022-04-13 Sprint

#13 Updated by Peter Amstutz about 2 months ago

  • Release changed from 46 to 51

#14 Updated by Peter Amstutz about 2 months ago

  • Target version changed from 2022-04-13 Sprint to 2022-04-27 Sprint

#15 Updated by Peter Amstutz about 2 months ago

  • Target version changed from 2022-04-27 Sprint to 2022-05-11 sprint

#16 Updated by Peter Amstutz about 2 months ago

I looked at this a little closer, I think the problem is that the test has a very short timeout (3 seconds) and that it is taking longer than that to start the singularity container, compared to Docker. (This is because the Docker image is downloaded and started before the container is started, whereas singularity runs from a compressed image on the keep mount, so it has more startup latency). I changed it to skip this test.

#17 Updated by Peter Amstutz 26 days ago

  • Target version changed from 2022-05-11 sprint to 2022-05-25 sprint

#18 Updated by Peter Amstutz 16 days ago

  • Target version changed from 2022-05-25 sprint to 2022-06-08 sprint

#19 Updated by Peter Amstutz 16 days ago

  • Target version changed from 2022-06-08 sprint to 2022-06-22 Sprint

#20 Updated by Peter Amstutz 16 days ago

Todo: increase the timeouts on this test in CWL 1.2.1

#21 Updated by Peter Amstutz 16 days ago

  • Target version changed from 2022-06-22 Sprint to 2022-07-06

Also available in: Atom PDF