Project

General

Profile

Actions

Feature #19166

closed

Container shell support for SLURM and LSF dispatchers

Added by Peter Amstutz 9 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Start date:
06/24/2022
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-
Release relationship:
Auto

Description

Unlike the arvados-dispatch-cloud case, the dispatcher doesn't know which HPC compute node will run the container, and the HPC compute node isn't necessarily even reachable from controller. To work around this, we will make an initial connection in the opposite direction and set up a tunnel.

  • crunch-run connects to new controller API arvados/v1/containers/{uuid}/gateway_tunnel, authenticated using the container key (GatewayAuthSecret)
  • controller registers its own internalURL as the container’s GatewayAddress, and uses the tunnel to route incoming container_ssh connections to crunch-run through the tunnel
  • there can be multiple controller hosts/processes; the container_ssh API on controller A will sometimes need to proxy through the same API on controller B

Subtasks 1 (0 open1 closed)

Task #19184: Review 19166-gateway-tunnelResolvedPeter Amstutz06/24/2022

Actions

Related issues

Related to Arvados Epics - Story #17207: External access to web services running in containersNew09/01/202303/31/2024

Actions
Actions #1

Updated by Peter Amstutz 9 months ago

  • Description updated (diff)
Actions #3

Updated by Peter Amstutz 8 months ago

  • Target version changed from 2022-07-20 to 2022-06-22 Sprint
Actions #4

Updated by Peter Amstutz 8 months ago

  • Assigned To set to Tom Clegg
Actions #5

Updated by Tom Clegg 8 months ago

  • Related to Story #17207: External access to web services running in containers added
Actions #6

Updated by Tom Clegg 8 months ago

  • Status changed from New to In Progress
  • Description updated (diff)
Actions #8

Updated by Peter Amstutz 8 months ago

  • Target version changed from 2022-06-22 Sprint to 2022-07-06
Actions #9

Updated by Tom Clegg 8 months ago

19166-gateway-tunnel @ 3fae0f0626c5152a5aa6f39f0874f0190f2131db -- developer-run-tests: #3196

Includes a doc page about HPC with a description of how the multiplex-tunnel setup works, and an update to the InternalURLs info in the install docs to reflect that it relies on controller-to-controller connections.

Actions #10

Updated by Tom Clegg 7 months ago

As discussed in chat, TODO: crunch-run should not set up a tunnel if it won't actually be used by controller (i.e., if crunch-run won't be saving the tunnel endpoint in the container record because $GatewayAddress is set).

Actions #11

Updated by Tom Clegg 7 months ago

19166-gateway-tunnel @ 87f3da84318306184165dae50f75ac6721d89285 -- developer-run-tests: #3211
  • don't set up tunnel if it won't be used
  • add required glue to slurm and lsf dispatchers (pass GatewayAuthSecret env var)
Actions #12

Updated by Peter Amstutz 7 months ago

  • Target version changed from 2022-07-06 to 2022-07-20
Actions #13

Updated by Tom Clegg 7 months ago

19166-gateway-tunnel @ dc70bbf9ea15395476107a3b8dff96f754a40998 -- developer-run-tests: #3216
  • add arvados-server dispatch-slurm subcommand (missed in #18947)
  • add crunch-run -version
  • improve some log/debug messages
  • fix plumbing so "shell {uuid} echo ok" exits after running, instead of hanging
  • tested on 9tee4 using slurm+singularity (works, although it's a bit disconcerting that you land in root@compute0:~# because singularity doesn't set up an imaginary hostname inside the container like docker does)
  • tested on 9tee4 using lsf+singularity (doesn't work on 9tee4 because firewall rules prohibit outgoing connections from non-root users to 127.0.0.1, and unlike Slurm, LSF on 9tee4 is configured to run crunch-run as the "crunch" user; but the error message shows that the LSF part per se is working)

todo: add an API handler to "GET .../ssh" so an old arvados-client returns a helpful "upgrade your client" error instead of a mysterious "405 method not allowed".

Actions #14

Updated by Tom Clegg 7 months ago

19166-gateway-tunnel @ 2261d1fd9e1b69d0a60f1f7fe9029317aeb4cf52 -- developer-run-tests: #3219

Example result using old arvados-client:

$ arvados-client shell 9tee4-xvhdp-49i6665mzesonf3
connecting to container 9tee4-dz642-zluu70frgwkb5ke
error setting up tunnel: server did not provide a tunnel: API endpoint is obsolete -- please upgrade your arvados-client program (HTTP 410)
Actions #15

Updated by Tom Clegg 7 months ago

  • Target version changed from 2022-07-20 to 2022-08-03 Sprint
Actions #16

Updated by Peter Amstutz 7 months ago

Let's go ahead and merge this, otherwise it's going to sit forever. LGTM.

Actions #17

Updated by Tom Clegg 7 months ago

(re-testing after merging main)

19166-gateway-tunnel @ 2e03d03bc55b5a612c2bf04d878a72f2ee420d99 -- developer-run-tests: #3246

Actions #18

Updated by Tom Clegg 6 months ago

  • Status changed from In Progress to Resolved

Applied in changeset arvados-private:commit:arvados|c9b8b9b9c78a77dd30b828914c8bee9fa8dcbb90.

Actions #19

Updated by Peter Amstutz about 2 months ago

  • Release set to 47
Actions

Also available in: Atom PDF