Bug #18260

[CWL] integration test 35 failed (singularity/LSF)

Added by Ward Vandewege 13 days ago. Updated about 4 hours ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
10/20/2021
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-
Release relationship:
Auto

Description

On 9tee4 (configured to use LSF + singularity), test 35 from the Arvados CWL test suite fails consistently (e.g. https://ci.arvados.org/view/Arvados%20build%20pipeline/job/run-cwl-test-9tee4/1251/console):

15:02:51 Test 35 failed: arvados-cwl-runner --disable-reuse --compute-checksum --api=containers --outdir=/tmp/tmp36moghbh --quiet 16169-no-listing-hint.cwl listing-job.yml
15:02:51 Test cwltool:LoadListingRequirement propagation
15:02:51 Returned non-zero
15:02:51 ERROR [container 16169-no-listing-hint.cwl] (9tee4-dz642-rw8f8gnf0bbtbf9) error log:
15:02:51 
15:02:51   2021-10-08T19:02:31.198703294Z crunch-run Not starting a gateway server (GatewayAuthSecret was not provided by dispatcher)
15:02:51   2021-10-08T19:02:31.198978468Z crunch-run crunch-run 2.3.0~dev20211008165008 (go1.17.1) started
15:02:51   2021-10-08T19:02:31.199000348Z crunch-run Executing container '9tee4-dz642-rw8f8gnf0bbtbf9' using singularity runtime
15:02:51   2021-10-08T19:02:31.199034119Z crunch-run Executing on host 'compute0.9tee4.arvadosapi.com'
15:02:51   2021-10-08T19:02:31.316882478Z crunch-run container token "v2/9tee4-gj3su-nzdweqs16uwdd1c/17n5aocpl1h49w3fdpt6itws6m3jzt2iacgk3hed3kl676czhj/9tee4-dz642-rw8f8gnf0bbtbf9" 
15:02:51   2021-10-08T19:02:31.317415885Z crunch-run Running [arv-mount --foreground --allow-other --read-write --storage-classes default --crunchstat-interval=10 --file-cache 268435456 --mount-tmp tmp0 --mount-by-pdh by_id --mount-by-id by_uuid /tmp/crunch-run.9tee4-dz642-rw8f8gnf0bbtbf9.754694554/keep3163623830]
15:02:51   2021-10-08T19:02:31.956248212Z crunch-run Fetching Docker image from collection 'a2498feb8afadffbde268ab00b07fa4b+261'
15:02:51   2021-10-08T19:02:32.048897698Z crunch-run Using Docker image id "sha256:e520e478c5777b695294faa6cb442edeb63112ae3f66b38ea3520c8e755c03c8" 
15:02:51   2021-10-08T19:02:32.048970278Z crunch-run Loading Docker image from keep
15:02:51   2021-10-08T19:02:32.978977970Z crunch-run Starting container
15:02:51   2021-10-08T19:02:32.980308098Z crunch-run Waiting for container to finish
15:02:51   2021-10-08T19:02:41.930650770Z stderr INFO /usr/bin/arvados-cwl-runner 2.3.0.dev20211006204117, arvados-python-client 2.3.0.dev20211006204117, cwltool 3.1.20210922203925
15:02:51   2021-10-08T19:02:41.949220287Z stderr INFO Resolved '/var/lib/cwl/workflow.json#main' to 'file:///var/lib/cwl/workflow.json#main'
15:02:51   2021-10-08T19:02:46.251818792Z stderr ERROR Tool definition failed initialization:
15:02:51   2021-10-08T19:02:46.251818792Z stderr Tool definition file:///var/lib/cwl/workflow.json#16169-step.cwl failed validation:
15:02:51   2021-10-08T19:02:46.251818792Z stderr ../../lib/cwl/workflow.json:1:12:    checking object `../../lib/cwl/workflow.json#main`
15:02:51   2021-10-08T19:02:46.251818792Z stderr ../../lib/cwl/workflow.json:1:170:     checking field `requirements`
15:02:51   2021-10-08T19:02:46.251818792Z stderr ../../lib/cwl/workflow.json:1:186:       checking item
15:02:51   2021-10-08T19:02:46.251818792Z stderr ../../lib/cwl/workflow.json:1:187:         Field `class` contains undefined reference to
15:02:51   2021-10-08T19:02:46.251818792Z stderr                                            `http://commonwl.org/cwltool#LoadListingRequirement`
15:02:51   2021-10-08T19:02:46.832474311Z crunch-run Complete
15:02:51 ERROR Overall process status is permanentFail
15:02:51 WARNING Final process status is permanentFail

On tordo (a-d-c + singularity) this test passes without problems.


Subtasks

Task #18267: Write down debugging strategies for a-c-rResolvedPeter Amstutz

History

#1 Updated by Ward Vandewege 13 days ago

  • Description updated (diff)

#2 Updated by Ward Vandewege 9 days ago

  • Release set to 42

#3 Updated by Ward Vandewege 9 days ago

  • Subject changed from [CWL] test 35 failed (singularity/LSF) to [CWL] integration test 35 failed (singularity/LSF)

#4 Updated by Ward Vandewege 9 days ago

I re-ran the failing test with --debug:

ci@shell:~/arvados/sdk/cwl/tests$ arvados-cwl-runner --disable-reuse --compute-checksum --api=containers --outdir=/tmp/tmpwn7jhrb1 --debug 16169-no-listing-hint.cwl listing-job.yml
INFO /usr/bin/arvados-cwl-runner 2.3.0.dev20211008201118, arvados-python-client 2.3.0.dev20211006204117, cwltool 3.1.20210922203925
INFO Resolved '16169-no-listing-hint.cwl' to 'file:///home/ci/arvados/sdk/cwl/tests/16169-no-listing-hint.cwl'
INFO Using cluster 9tee4 (https://workbench2.9tee4.arvadosapi.com/)
INFO Upload local files: "tmp1" 
INFO Using collection 1c874bd4fe26056034be8a06e4c3133b+65 (9tee4-4zz18-8ro3shrbetz8wvc)
INFO Using collection cache size 256 MiB
DEBUG ENTER jobiter 1634066162.1472414
DEBUG EXIT jobiter 1634066162.1485233 0.0012819766998291016
DEBUG ENTER run 1634066162.1486175
INFO [container 16169-no-listing-hint.cwl] submitted container_request 9tee4-xvhdp-rfad19xav4j3qau
INFO Monitor workflow progress at https://workbench2.9tee4.arvadosapi.com/processes/9tee4-xvhdp-rfad19xav4j3qau
DEBUG EXIT run 1634066162.5738823 0.425264835357666
DEBUG ENTER jobiter 1634066162.57399
DEBUG EXIT jobiter 1634066162.5740707 8.058547973632812e-05
INFO [container 16169-no-listing-hint.cwl] 9tee4-xvhdp-rfad19xav4j3qau is Final
DEBUG {'9tee4-bi6l4-uwgvqzzw4z97qca': OrderedDict([('href', '/keep_services/9tee4-bi6l4-uwgvqzzw4z97qca'), ('kind', 'arvados#keepService'), ('etag', 'c65xiys117q01qz9s7zvjt66i'), ('uuid', '9tee4-bi6l4-uwgvqzzw4z97qca'), ('owner_uuid', '9tee4-tpzed-ruo0j78b8s7jk57'), ('created_at', '2014-08-27T02:19:29.655913000Z'), ('modified_by_client_uuid', '9tee4-ozdt8-yft7aixdt2dx5ni'), ('modified_by_user_uuid', '9tee4-tpzed-ruo0j78b8s7jk57'), ('modified_at', '2014-08-27T02:20:21.044273000Z'), ('service_host', 'keep0.9tee4.arvadosapi.com'), ('service_port', 25107), ('service_ssl_flag', False), ('service_type', 'disk'), ('read_only', False), ('_service_root', 'http://keep0.9tee4.arvadosapi.com:25107/')]), '9tee4-bi6l4-1fum8ggyzc88g87': OrderedDict([('href', '/keep_services/9tee4-bi6l4-1fum8ggyzc88g87'), ('kind', 'arvados#keepService'), ('etag', 'dn6g7rkwx5gpwq0h053wyquax'), ('uuid', '9tee4-bi6l4-1fum8ggyzc88g87'), ('owner_uuid', '9tee4-tpzed-ruo0j78b8s7jk57'), ('created_at', '2014-08-27T02:20:27.578073000Z'), ('modified_by_client_uuid', '9tee4-ozdt8-yft7aixdt2dx5ni'), ('modified_by_user_uuid', '9tee4-tpzed-ruo0j78b8s7jk57'), ('modified_at', '2014-08-27T02:20:43.978890000Z'), ('service_host', 'keep1.9tee4.arvadosapi.com'), ('service_port', 25107), ('service_ssl_flag', False), ('service_type', 'disk'), ('read_only', False), ('_service_root', 'http://keep1.9tee4.arvadosapi.com:25107/')])}
DEBUG 196c3dbc8725c9b49c1286d182922388+37950+A350e222fe194db5094cac41517e9ec170aa230da@6178540a: ['http://keep0.9tee4.arvadosapi.com:25107/', 'http://keep1.9tee4.arvadosapi.com:25107/']
DEBUG Request: GET http://keep0.9tee4.arvadosapi.com:25107/196c3dbc8725c9b49c1286d182922388+37950+A350e222fe194db5094cac41517e9ec170aa230da@6178540a
INFO GET 200: 37950 bytes in 93.83320808410645 msec (0.386 MiB/sec)
ERROR [container 16169-no-listing-hint.cwl] (9tee4-dz642-h60jcvdr3ru19xe) error log:

  2021-10-12T19:16:20.481902684Z stderr   File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/workflow.py", line 234, in __init__
  2021-10-12T19:16:20.481902684Z stderr     self.embedded_tool = load_tool(toolpath_object["run"], loadingContext)
  2021-10-12T19:16:20.481902684Z stderr   File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/load_tool.py", line 481, in load_tool
  2021-10-12T19:16:20.481902684Z stderr     loadingContext, workflowobj, uri
  2021-10-12T19:16:20.481902684Z stderr   File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/load_tool.py", line 381, in resolve_and_validate_document
  2021-10-12T19:16:20.481902684Z stderr     document_loader.resolve_all(workflowobj, fileuri)
  2021-10-12T19:16:20.481902684Z stderr   File "schema_salad/ref_resolver.py", line 958, in resolve_all
  2021-10-12T19:16:20.481902684Z stderr   File "schema_salad/ref_resolver.py", line 1190, in validate_links
  2021-10-12T19:16:20.481902684Z stderr schema_salad.exceptions.ValidationException: ../../lib/cwl/workflow.json:1:12:  checking object `../../lib/cwl/workflow.json#main`
  2021-10-12T19:16:20.481902684Z stderr ../../lib/cwl/workflow.json:1:170:   checking field `requirements`
  2021-10-12T19:16:20.481902684Z stderr ../../lib/cwl/workflow.json:1:186:     checking item
  2021-10-12T19:16:20.481902684Z stderr ../../lib/cwl/workflow.json:1:187:       Field `class` contains undefined reference to
  2021-10-12T19:16:20.481902684Z stderr                                          `http://commonwl.org/cwltool#LoadListingRequirement`
  2021-10-12T19:16:20.481902684Z stderr
  2021-10-12T19:16:20.481902684Z stderr The above exception was the direct cause of the following exception:
  2021-10-12T19:16:20.481902684Z stderr
  2021-10-12T19:16:20.481902684Z stderr Traceback (most recent call last):
  2021-10-12T19:16:20.481902684Z stderr   File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/main.py", line 1067, in main
  2021-10-12T19:16:20.481902684Z stderr     tool = make_tool(uri, loadingContext)
  2021-10-12T19:16:20.481902684Z stderr   File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/load_tool.py", line 462, in make_tool
  2021-10-12T19:16:20.481902684Z stderr     tool = loadingContext.construct_tool_object(processobj, loadingContext)
  2021-10-12T19:16:20.481902684Z stderr   File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/arvados_cwl/executor.py", line 210, in arv_make_tool
  2021-10-12T19:16:20.481902684Z stderr     return ArvadosWorkflow(self, toolpath_object, loadingContext)
  2021-10-12T19:16:20.481902684Z stderr   File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/arvados_cwl/arvworkflow.py", line 169, in __init__
  2021-10-12T19:16:20.481902684Z stderr     super(ArvadosWorkflow, self).__init__(toolpath_object, loadingContext)
  2021-10-12T19:16:20.481902684Z stderr   File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/workflow.py", line 106, in __init__
  2021-10-12T19:16:20.481902684Z stderr     step, index, loadingContext, loadingContext.prov_obj
  2021-10-12T19:16:20.481902684Z stderr   File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/arvados_cwl/arvworkflow.py", line 352, in make_workflow_step
  2021-10-12T19:16:20.481902684Z stderr     return ArvadosWorkflowStep(toolpath_object, pos, loadingContext, self.arvrunner, *argc, **argv)
  2021-10-12T19:16:20.481902684Z stderr   File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/arvados_cwl/arvworkflow.py", line 145, in __init__
  2021-10-12T19:16:20.481902684Z stderr     super(ArvadosWorkflowStep, self).__init__(toolpath_object, pos, loadingContext, *argc, **argv)
  2021-10-12T19:16:20.481902684Z stderr   File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/workflow.py", line 241, in __init__
  2021-10-12T19:16:20.481902684Z stderr     ) from vexc
  2021-10-12T19:16:20.481902684Z stderr cwltool.errors.WorkflowException: Tool definition file:///var/lib/cwl/workflow.json#16169-step.cwl failed validation:
  2021-10-12T19:16:20.481902684Z stderr ../../lib/cwl/workflow.json:1:12:    checking object `../../lib/cwl/workflow.json#main`
  2021-10-12T19:16:20.481902684Z stderr ../../lib/cwl/workflow.json:1:170:     checking field `requirements`
  2021-10-12T19:16:20.481902684Z stderr ../../lib/cwl/workflow.json:1:186:       checking item
  2021-10-12T19:16:20.481902684Z stderr ../../lib/cwl/workflow.json:1:187:         Field `class` contains undefined reference to
  2021-10-12T19:16:20.481902684Z stderr                                            `http://commonwl.org/cwltool#LoadListingRequirement`
  2021-10-12T19:16:20.903520059Z crunch-run Complete
ERROR Overall process status is permanentFail
INFO Final output collection 5551f5d7ec57a8aa9758d14a976e87e9+57
INFO Output at https://workbench2.9tee4.arvadosapi.com/collections/5551f5d7ec57a8aa9758d14a976e87e9+57
{}
WARNING Final process status is permanentFail

#5 Updated by Peter Amstutz 8 days ago

  • Target version changed from 2021-10-13 sprint to 2021-10-27 sprint

#6 Updated by Peter Amstutz 8 days ago

  • Assigned To set to Ward Vandewege

#7 Updated by Ward Vandewege 8 days ago

Curiously, this works fine when run with --local, which is going to make this a bit harder to debug:

(venv) ci@shell:~/arvados/sdk/cwl/tests$ arvados-cwl-runner --disable-reuse --compute-checksum --api=containers --outdir=/tmp/tmpwn7jhrb1 --debug --submit-runner-image arvados/jobs:2.3.0.dev20211006204117 --local 16169-no-listing-hint.cwl listing-job.yml 
INFO /home/ci/arvados/sdk/cwl/tests/venv/bin/arvados-cwl-runner 2.3.0.dev20211013202728, arvados-python-client 2.2.2, cwltool 3.1.20210922203925
INFO Resolved '16169-no-listing-hint.cwl' to 'file:///home/ci/arvados/sdk/cwl/tests/16169-no-listing-hint.cwl'
INFO Using cluster 9tee4 (https://workbench2.9tee4.arvadosapi.com/)
INFO Upload local files: "tmp1" 
INFO Using collection 1c874bd4fe26056034be8a06e4c3133b+65 (9tee4-4zz18-8ro3shrbetz8wvc)
INFO Using collection cache size 256 MiB
DEBUG ENTER jobiter 1634158594.4664106
DEBUG [workflow 16169-no-listing-hint.cwl] initialized from file:///home/ci/arvados/sdk/cwl/tests/16169-no-listing-hint.cwl
DEBUG EXIT jobiter 1634158594.4677873 0.0013766288757324219
DEBUG ENTER run 1634158594.4678693
INFO [workflow 16169-no-listing-hint.cwl] start
DEBUG EXIT run 1634158594.4680102 0.00014090538024902344
DEBUG ENTER jobiter 1634158594.4680834
DEBUG [workflow 16169-no-listing-hint.cwl] inputs {
    "d": {
        "class": "Directory",
        "location": "keep:1c874bd4fe26056034be8a06e4c3133b+65/tmp1",
        "basename": "tmp1" 
    }
}
INFO [workflow 16169-no-listing-hint.cwl] starting step step1
DEBUG [step step1] job input {
    "file:///home/ci/arvados/sdk/cwl/tests/16169-no-listing-hint.cwl#step1/d": {
        "class": "Directory",
        "location": "keep:1c874bd4fe26056034be8a06e4c3133b+65/tmp1",
        "basename": "tmp1" 
    }
}
DEBUG [step step1] evaluated job input to {
    "file:///home/ci/arvados/sdk/cwl/tests/16169-no-listing-hint.cwl#step1/d": {
        "class": "Directory",
        "location": "keep:1c874bd4fe26056034be8a06e4c3133b+65/tmp1",
        "basename": "tmp1" 
    }
}
INFO [step step1] start
DEBUG [job step1] initializing from file:///home/ci/arvados/sdk/cwl/tests/wf/16169-step.cwl as part of step step1
DEBUG [job step1] {
    "d": {
        "class": "Directory",
        "location": "keep:1c874bd4fe26056034be8a06e4c3133b+65/tmp1",
        "basename": "tmp1" 
    }
}
DEBUG [job step1] path mappings is {
    "keep:1c874bd4fe26056034be8a06e4c3133b+65/tmp1": [
        "keep:1c874bd4fe26056034be8a06e4c3133b+65/tmp1",
        "/keep/1c874bd4fe26056034be8a06e4c3133b+65/tmp1",
        "Directory",
        true
    ]
}
DEBUG [job step1] command line bindings is [
    {
        "position": [
            0,
            0
        ],
        "datum": "echo" 
    },
    {
        "position": [
            0,
            1
        ],
        "valueFrom": "${if(inputs.d.listing === undefined) {return 'true';} else {return 'false';}}" 
    }
]
DEBUG EXIT jobiter 1634158596.1322594 1.6641759872436523
DEBUG ENTER run 1634158596.1329393
DEBUG ENTER generatefiles step1 1634158596.133355
DEBUG EXIT generatefiles step1 1634158596.1334329 7.796287536621094e-05
INFO [container step1] 9tee4-xvhdp-w6eyxafpak1vhm6 state is Committed
DEBUG EXIT run 1634158596.2429066 0.10996723175048828
DEBUG ENTER jobiter 1634158596.2429976
DEBUG EXIT jobiter 1634158596.243226 0.00022840499877929688
DEBUG ENTER jobiter 1634158599.2435837
DEBUG EXIT jobiter 1634158599.2438827 0.0002989768981933594
DEBUG ENTER jobiter 1634158602.2441397
DEBUG EXIT jobiter 1634158602.2444339 0.0002942085266113281
DEBUG ENTER jobiter 1634158605.2447248
DEBUG EXIT jobiter 1634158605.245025 0.0003001689910888672
INFO [container step1] 9tee4-xvhdp-w6eyxafpak1vhm6 is Final
DEBUG Creating collection reader for ae755cd1b3cff63152ff4200f4dea7e9+52
DEBUG {'9tee4-bi6l4-uwgvqzzw4z97qca': OrderedDict([('href', '/keep_services/9tee4-bi6l4-uwgvqzzw4z97qca'), ('kind', 'arvados#keepService'), ('etag', 'c65xiys117q01qz9s7zvjt66i'), ('uuid', '9tee4-bi6l4-uwgvqzzw4z97qca'), ('owner_uuid', '9tee4-tpzed-ruo0j78b8s7jk57'), ('created_at', '2014-08-27T02:19:29.655913000Z'), ('modified_by_client_uuid', '9tee4-ozdt8-yft7aixdt2dx5ni'), ('modified_by_user_uuid', '9tee4-tpzed-ruo0j78b8s7jk57'), ('modified_at', '2014-08-27T02:20:21.044273000Z'), ('service_host', 'keep0.9tee4.arvadosapi.com'), ('service_port', 25107), ('service_ssl_flag', False), ('service_type', 'disk'), ('read_only', False), ('_service_root', 'http://keep0.9tee4.arvadosapi.com:25107/')]), '9tee4-bi6l4-1fum8ggyzc88g87': OrderedDict([('href', '/keep_services/9tee4-bi6l4-1fum8ggyzc88g87'), ('kind', 'arvados#keepService'), ('etag', 'dn6g7rkwx5gpwq0h053wyquax'), ('uuid', '9tee4-bi6l4-1fum8ggyzc88g87'), ('owner_uuid', '9tee4-tpzed-ruo0j78b8s7jk57'), ('created_at', '2014-08-27T02:20:27.578073000Z'), ('modified_by_client_uuid', '9tee4-ozdt8-yft7aixdt2dx5ni'), ('modified_by_user_uuid', '9tee4-tpzed-ruo0j78b8s7jk57'), ('modified_at', '2014-08-27T02:20:43.978890000Z'), ('service_host', 'keep1.9tee4.arvadosapi.com'), ('service_port', 25107), ('service_ssl_flag', False), ('service_type', 'disk'), ('read_only', False), ('_service_root', 'http://keep1.9tee4.arvadosapi.com:25107/')])}
DEBUG 74d9a83219cabaab06a69fd318873f33+5+A05437ebe0dadd0b1a34018fde4bd5110bd6d1072@6179bd0e: ['http://keep1.9tee4.arvadosapi.com:25107/', 'http://keep0.9tee4.arvadosapi.com:25107/']
DEBUG Request: GET http://keep1.9tee4.arvadosapi.com:25107/74d9a83219cabaab06a69fd318873f33+5+A05437ebe0dadd0b1a34018fde4bd5110bd6d1072@6179bd0e
INFO GET 200: 5 bytes in 3.016233444213867 msec (0.002 MiB/sec)
DEBUG [step step1] produced output {
    "file:///home/ci/arvados/sdk/cwl/tests/16169-no-listing-hint.cwl#step1/out": {
        "location": "keep:ae755cd1b3cff63152ff4200f4dea7e9+52/output.txt",
        "basename": "output.txt",
        "nameroot": "output",
        "nameext": ".txt",
        "class": "File",
        "checksum": "sha1$724ba28f4a9a1b472057ff99511ed393a45552e1",
        "size": 5
    }
}
INFO [step step1] completed success
INFO [workflow 16169-no-listing-hint.cwl] completed success
DEBUG [workflow 16169-no-listing-hint.cwl] outputs {
    "out": {
        "location": "keep:ae755cd1b3cff63152ff4200f4dea7e9+52/output.txt",
        "basename": "output.txt",
        "nameroot": "output",
        "nameext": ".txt",
        "class": "File",
        "checksum": "sha1$724ba28f4a9a1b472057ff99511ed393a45552e1",
        "size": 5
    }
}
INFO Overall process status is success
DEBUG ENTER jobiter 1634158606.5850677
DEBUG EXIT jobiter 1634158606.5854146 0.00034689903259277344
DEBUG f906f5732e2d841ebd85c2e898b7fd85+202: ['http://keep1.9tee4.arvadosapi.com:25107/', 'http://keep0.9tee4.arvadosapi.com:25107/']
DEBUG Pool max threads is 2
DEBUG Request: PUT http://keep1.9tee4.arvadosapi.com:25107/f906f5732e2d841ebd85c2e898b7fd85
DEBUG Request: PUT http://keep0.9tee4.arvadosapi.com:25107/f906f5732e2d841ebd85c2e898b7fd85
INFO PUT 200: 202 bytes in 21.41737937927246 msec (0.009 MiB/sec)
DEBUG KeepWriterThread <KeepWriterThread(Thread-6, started daemon 139671234782976)> succeeded f906f5732e2d841ebd85c2e898b7fd85+202 http://keep0.9tee4.arvadosapi.com:25107/
INFO PUT 200: 202 bytes in 27.50682830810547 msec (0.007 MiB/sec)
DEBUG KeepWriterThread <KeepWriterThread(Thread-5, started daemon 139671246845696)> succeeded f906f5732e2d841ebd85c2e898b7fd85+202 http://keep1.9tee4.arvadosapi.com:25107/
INFO Final output collection b82af5190efb732119bb6014967d961b+113 "Output of 16169-no-listing-hint.cwl (2021-10-13T20:56:46.637Z)" (9tee4-4zz18-uaywkwyo6anj8r9)
{
    "out": {
        "location": "keep:b82af5190efb732119bb6014967d961b+113/output.txt",
        "basename": "output.txt",
        "class": "File",
        "checksum": "sha1$724ba28f4a9a1b472057ff99511ed393a45552e1",
        "size": 5
    }
}
INFO Final process status is success

#8 Updated by Ward Vandewege 7 days ago

The problem appears to be that the generated /var/lib/workflow.json specifies cwlVersion: 1.0, but LoadListingRequirement was only added in CWL 1.1 (cf https://www.commonwl.org/v1.1/CommandLineTool.html).

Why is this happening, though? And why only on 9tee4?

This is workflow.json, as seen from the perspective of cwltool/load_tool.py in the resolve_and_validate_document function:

{
  "$graph": [
    {
      "class": "Workflow",
      "hints": [
        {
          "class": "DockerRequirement",
          "dockerPull": "arvados/jobs:2.2.2",
          "http://arvados.org/cwl#dockerCollectionPDH": "cb1f184dc5600118c2d2cb873b6771f6+261" 
        }
      ],
      "id": "#main",
      "inputs": [
        {
          "id": "#main/d",
          "type": "Directory" 
        }
      ],
      "outputs": [
        {
          "id": "#main/out",
          "outputSource": "#main/step1/out",
          "type": "File" 
        }
      ],
      "requirements": [
        {
          "class": "http://commonwl.org/cwltool#LoadListingRequirement",
          "loadListing": "no_listing" 
        }
      ],
      "steps": [
        {
          "id": "#main/step1",
          "in": [
            {
              "id": "#main/step1/d",
              "source": "#main/d" 
            }
          ],
          "out": [
            "#main/step1/out" 
          ],
          "run": "#16169-step.cwl" 
        }
      ]
    },
    {
      "arguments": [
        "echo",
        "${if(inputs.d.listing === undefined) {return 'true';} else {return 'false';}}" 
      ],
      "class": "CommandLineTool",
      "id": "#16169-step.cwl",
      "inputs": [
        {
          "id": "#16169-step.cwl/d",
          "type": "Directory" 
        }
      ],
      "outputs": [
        {
          "id": "#16169-step.cwl/out",
          "outputBinding": {
            "glob": "output.txt" 
          },
          "type": "File" 
        }
      ],
      "requirements": [
        {
          "class": "DockerRequirement",
          "dockerPull": "debian:buster-slim",
          "http://arvados.org/cwl#dockerCollectionPDH": "d2903f206785156adcec982c5091d44a+175" 
        },
        {
          "class": "InlineJavascriptRequirement" 
        }
      ],
      "stdout": "output.txt" 
    }
  ],
  "$namespaces": {
    "arv": "http://arvados.org/cwl#",
    "cwltool": "http://commonwl.org/cwltool#" 
  },
  "cwlVersion": "v1.0" 
}

#9 Updated by Peter Amstutz 7 days ago

cwltool:LoadListingRequirement expands to http://commonwl.org/cwltool#LoadListingRequirement

It's an process requirement extension that was added to be able to use the feature in v1.0. It was added to the spec in v1.1. The purpose of this test is to check that the extension version of feature is still available for use in v1.0 workflows.

Arvados-cwl-runner implements this by loading a custom schema that includes the extension. That is also supposed to enable validation.

The validation is checking that the class field refers to a known entity (e.g. cwltool:LoadListingRequirement)

The specific function that does the checking is schema_salad/ref_resolver.py:1036 validate_link()

It checks a few things, the vocabulary (CWL keywords), the "reverse vocabulary" (certain keywords can be expanded to URIs but are supposed to mean the same thing) and the document index (identifiers that are declared in the document itself).

I'm actually not entirely certain which one of those tables the string "http://commonwl.org/cwltool#LoadListingRequirement" would normally appear in, so that could be a clue.

#10 Updated by Peter Amstutz 7 days ago

Ok, here's what I think is going on here.

  1. At some point, I removed cwltool:LoadListingRequirement from the Arvados v1.2 extensions. This was on the assumption that it was redundant because the internal updater, upon loading a v1.0 file, will internally update cwltool:LoadListingRequirement extension to the standard LoadListingRequirement.
  2. For some reason, it is validating the document using the v1.2 schema, but cwltool:LoadListingRequirement appears instead of LoadListingRequirement
  3. The link checking is simpleminded and doesn't really know if something is just an identifier or a link to an actual resource (e.g. you can have "run: http://example.com/tool.cwl") so it goes and tries to make an HTTP HEAD request to determine if something exists
  4. https://www.commonwl.org/cwltool#LoadListingRequirement returns a 404 page. schema salad calls raise_for_status() which is expected to raise an error if it gets 404 (but apparently doesn't?)
  5. Since the link check doesn't fail, it accepts https://www.commonwl.org/cwltool#LoadListingRequirement as a valid identifier and validation passes
  6. Meanwhile, suppose that 9tee4 allows connections to the API server but disallows outgoing connections to the public internet from its compute nodes. This would cause it to raise a connection error, which reports an invalid identifier, which results in the validation error. This would explain why the behavior is inconsistent between 9tee4 and other clusters.

Questions:

  • Why isn't cwltool:LoadListingRequirement converted to the LoadListingRequirement from the spec?
  • Why doesn't raise_for_status() raise an exception?

#11 Updated by Ward Vandewege 6 days ago

  • Assigned To changed from Ward Vandewege to Peter Amstutz
  • Status changed from New to In Progress

#12 Updated by Ward Vandewege 6 days ago

commonwl.org is now on the dns resolution whitelist for 9tee4's compute nodes (to deal with a "networkaccess enabled" test, #224 in the cwl 1.2 conformance test suite). This will affect testing for this bug, beware!

#13 Updated by Peter Amstutz 3 days ago

Answers to previous questions:

  • cwltool:LoadListingRequirement appears because it is loading un-updated code. The code is not updated because a different fragment of the same file was updated, but when it does that, it marks the entire file as being updated, without having updated the other fragments in the file
  • raise_for_status does not raise an error because it is not following redirects, and there is a blanket http->https redirect, as a result it gets a 3xx redirect which would have led to a 404, but since it stopped at the 3xx it didn't raise an error.

The 1st issue is a cwltool bug. I think I have a solution but I need to write a test that demonstrates the bug to confirm the fix.

The 2nd issue is a schema salad bug. I've made a PR with a fix (just adds allow_redirects) but that also needs a test.

#14 Updated by Peter Amstutz 2 days ago

A little more detail.

  1. loading #main fragment
  2. resolve_and_validate_document gets the root document and puts it into "workflowobj"
  3. it loads the 1.0 schema which does contain cwltool:LoadListingRequirement
  4. it calls resolve_all() on workflowobj, this passes
  5. then it calls resolve_ref() to get the fragment we're interested in
  6. then it checks to see if the fragment needs to be updated
  7. it updates the fragment and replaces the fragment in the "index" with the updated one
  8. it updates document "metadata" with the new version ** this is where things start to go wrong
    1. it is noting in the document was updated in metadata, but it didn't update the whole document, only the fragment
    2. the "metadata" dict is actually the index entry of the old un-updated document

later:

  1. loading a different fragment
  2. resolve_and_validate_document gets the root document and puts it into "workflowobj"
  3. the document is marked as being v1.2 from when the #main fragment was updated
  4. however what's actually in the index wasn't updated, it's v1.0 by accident
  5. it loads the 1.2 schema which doesn't contain cwltool:LoadListingRequirement
  6. it calls resolve_all() on workflowobj, this fails

#15 Updated by Peter Amstutz about 8 hours ago

  • Status changed from In Progress to Resolved

Also available in: Atom PDF