Bug #17398

[crunch-dispatch-local] [crunch-run] error starting gateway server: missing port in address

Added by Javier Bértoli 3 months ago. Updated 2 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Start date:
02/18/2021
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-

Description

Trying to update and test the new configuration changes for Arvados with the salt-installer, I run the deploy using the latest dev packages (2.2.0~dev20210215190825-1).

With them, I cannot successfully run a single-node deploy. The "tools/salt-install/tests/run-test.sh" script fails to run the workflow with the following error:

cwl-runner --debug hasher-workflow.cwl hasher-workflow-job.yml
INFO /usr/bin/cwl-runner 2.2.0.dev20210205202546, arvados-python-client 2.2.0.dev20210205202546, cwltool 3.0.20201121085451
INFO Resolved 'hasher-workflow.cwl' to 'file:///root/cluster_tests/hasher-workflow.cwl'
INFO hasher-workflow.cwl:36:7: Unknown hint WorkReuse
INFO hasher-workflow.cwl:50:7: Unknown hint WorkReuse
INFO hasher-workflow.cwl:64:7: Unknown hint WorkReuse
INFO Using cluster harpo (https://workbench2.harpo.local:8443/)
INFO Upload local files: "test.txt" 
DEBUG {'harpo-bi6l4-a31be630d4e27ba0': OrderedDict([('href', '/keep_services/harpo-bi6l4-a31be630d4e27ba0'), ('kind', 'arvados#keepService'), ('etag', '5hqqxk0vrj54wia5k0ucch6o4'), ('uuid', 'harpo-bi6l4-a31be630d4e27ba0'), ('owner_uuid', 'harpo-tpzed-000000000000000'), ('created_at', '2021-02-17T13:27:59.426575000Z'), ('modified_by_client_uuid', None), ('modified_by_user_uuid', 'harpo-tpzed-000000000000000'), ('modified_at', '2021-02-17T13:27:59.426575000Z'), ('service_host', 'keep.harpo.local'), ('service_port', 8443), ('service_ssl_flag', True), ('service_type', 'proxy'), ('read_only', False), ('_service_root', 'https://keep.harpo.local:8443/')])}
DEBUG 7f2cee57647f15dd443e35537b202981+104: ['https://keep.harpo.local:8443/']
DEBUG Pool max threads is 1
DEBUG Request: PUT https://keep.harpo.local:8443/7f2cee57647f15dd443e35537b202981
INFO PUT 200: 104 bytes in 53.35521697998047 msec (0.002 MiB/sec)
DEBUG KeepWriterThread <KeepWriterThread(Thread-1, started daemon 140587978630912)> succeeded 7f2cee57647f15dd443e35537b202981+104 https://keep.harpo.local:8443/
INFO Using collection f55e750025853f5b8ccae3ca79240f65+54 (harpo-4zz18-l6kmq8rt8ccu8um)
INFO Using collection cache size 256 MiB
DEBUG ENTER jobiter 1613568972.0955696
DEBUG EXIT jobiter 1613568972.096748 0.0011785030364990234
DEBUG ENTER run 1613568972.0968366
DEBUG EXIT run 1613568972.096923 8.654594421386719e-05
DEBUG ENTER jobiter 1613568972.0969877
DEBUG EXIT jobiter 1613568972.0970328 4.506111145019531e-05
INFO [container hasher-workflow.cwl] submitted container_request harpo-xvhdp-39twskql6ok3kw3
INFO Monitor workflow progress at https://workbench2.harpo.local:8443/processes/harpo-xvhdp-39twskql6ok3kw3
INFO [container hasher-workflow.cwl] harpo-xvhdp-39twskql6ok3kw3 is Final
ERROR [container hasher-workflow.cwl] (harpo-dz642-bb0isvfn2x3h6an) error log:
  ** log is empty **
ERROR Overall process status is permanentFail
INFO Final output collection None
INFO Output at https://workbench2.harpo.local:8443/collections/None
{}
WARNING Final process status is permanentFail

Checking Arvados' component logs, I find this error in crunch-dispatch-local:

Feb 17 13:36:14 harpo crunch-dispatch-local[1812]: {"level":"info","msg":"finalized container harpo-dz642-9zdr6wxubfj2k56","time":"2021-02-17T13:36:14.446343132Z"}
Feb 17 13:36:14 harpo crunch-dispatch-local[1812]: 2021/02/17 13:36:14 error starting gateway server: missing port in address

Same setup works OK with the stable version (2.1.1)

Extra information that might help:

If I understand correctly, it run OK when trying a multi-host deploy in the cloud (using arvados-dispatch-cloud and the compiled crunch-run binary).

Seems to be related to https://dev.arvados.org/issues/17170


Subtasks

Task #17405: Review 17398-no-ctr-gatewayResolvedTom Clegg


Related issues

Related to Arvados - Feature #17170: Shell into container proof of conceptResolved01/14/2021

Blocks Arvados - Story #17246: Single node salt install improvementsResolved03/31/2021

Associated revisions

Revision 752ab2f0
Added by Tom Clegg 3 months ago

Merge branch '17398-no-ctr-gateway'

refs #17398

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

History

#1 Updated by Nico César 3 months ago

  • Related to Feature #17170: Shell into container proof of concept added

#2 Updated by Javier Bértoli 3 months ago

  • Blocks Story #17246: Single node salt install improvements added

#3 Updated by Javier Bértoli 3 months ago

  • Blocks Support #17320: Explain what additonal configuration is needed for provision.sh to go to production added

#4 Updated by Javier Bértoli 3 months ago

  • Blocks deleted (Support #17320: Explain what additonal configuration is needed for provision.sh to go to production)

#5 Updated by Tom Clegg 3 months ago

  • Target version changed from To Be Groomed to 2021-03-03 sprint
  • Assigned To set to Tom Clegg
  • Status changed from New to In Progress
  • Subject changed from [crunch-dispatch-local] error running a workflow to [crunch-dispatch-local] [crunch-run] error starting gateway server: missing port in address

#7 Updated by Lucas Di Pentima 3 months ago

The fix is super trivial, LGTM. I wonder why we didn't catch this with a test, aren't we doing tests with crunch-dispatch-local (or even crunch-dispatch-slurm)? Maybe if integration tests are too cumbersome to make, we can make the Gateway start failure a non-critical error?

#8 Updated by Tom Clegg 3 months ago

A crunch-run integration test pretty much requires a fully functioning cluster. I'd really like to have a loopback driver for a-d-c so we can get rid of crunch-dispatch-local, and (assuming docker is available) run a container in lib/controller integration tests with federation features and everything.

(Other than this bug) I don't see a reason why we wouldn't be able to start the gateway service so I'm not keen to turn it into a "best effort" thing.

Merging & leaving open, both points seem to deserve some more thought.

#9 Updated by Peter Amstutz 2 months ago

  • Target version changed from 2021-03-03 sprint to 2021-03-17 sprint

#10 Updated by Peter Amstutz 2 months ago

  • Target version changed from 2021-03-17 sprint to 2021-03-03 sprint
  • Status changed from In Progress to Resolved

Also available in: Atom PDF