Project

General

Profile

Actions

Feature #17756

closed

Initial implementation of LSF dispatcher

Added by Peter Amstutz over 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Start date:
07/14/2021
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-
Release relationship:
Auto

Description

Similar to crunch-dispatch-slurm. Should be packaged as a subcommand of arvados-server, though.

Use Arvados SDK "Dispatcher" class to interact with API server.

Start with basic dispatch to queue (bsub) and monitoring.

Stub out commands to support unit testing (as done with slurm).

Test on 9tee4


Subtasks 1 (0 open1 closed)

Task #17795: Review 17756-dispatch-lsfResolvedTom Clegg07/14/2021

Actions

Related issues

Related to Arvados Epics - Story #16304: LSF supportResolved04/01/202109/30/2021

Actions
Actions #1

Updated by Peter Amstutz over 1 year ago

  • Category set to Crunch
Actions #2

Updated by Tom Clegg over 1 year ago

  • Description updated (diff)
Actions #3

Updated by Peter Amstutz over 1 year ago

  • Description updated (diff)
Actions #4

Updated by Peter Amstutz over 1 year ago

  • Description updated (diff)
Actions #5

Updated by Peter Amstutz over 1 year ago

  • Assigned To set to Tom Clegg
Actions #7

Updated by Peter Amstutz over 1 year ago

  • Target version changed from 2021-06-23 sprint to 2021-07-07 sprint
Actions #8

Updated by Tom Clegg over 1 year ago

  • Status changed from New to In Progress
testing c1ca6ec2e on 9tee4:
  • add DispatchLSF: InternalURLs: "http://0.0.0.0:9009": true to Services section in /etc/arvados/config.yml
  • add Containers: RuntimeEngine: singularity in /etc/arvados/config.yml
  • systemctl stop crunch-dispatch-slurm
  • install ...path/to/arvados-server /usr/local/bin/arvados-dispatch-lsf
  • arvados-dispatch-lsf
  • submit a container request
{"Listen":"[::]:9009","PID":29877,"Service":"arvados-dispatch-lsf","URL":"http://0.0.0.0:9009/","level":"info","msg":"listening","time":"2021-07-02T17:17:50.230755112Z"}
{"PID":29877,"level":"warning","msg":"FIXME: checkLsfQueueForOrphans","time":"2021-07-02T17:17:50.230759598Z"}
{"PID":29877,"level":"info","msg":"Submitting container 9tee4-dz642-unjdidp7zb08qdm to LSF","time":"2021-07-02T17:17:56.390746354Z"}
{"PID":29877,"level":"info","msg":"bsub command [\"sudo\" \"-E\" \"-u\" \"crunch\" \"bsub\" \"-J\" \"9tee4-dz642-unjdidp7zb08qdm\" \"-R\" \"rusage[mem=757MB:tmp=640MB] affinity[core(1)]\"] script \"#!/bin/sh\\nexec 'crunch-run' '--runtime-engine=singularity' '-cgroup-parent-subsystem=memory' '9tee4-dz642-unjdidp7zb08qdm'\\n\"","time":"2021-07-02T17:17:56.390943355Z"}
{"PID":29877,"level":"info","msg":"bsub finished","stdout":"Job \u003c190\u003e is submitted to default queue \u003cnormal\u003e.\n","time":"2021-07-02T17:17:56.426583960Z"}
{"PID":29877,"level":"info","msg":"Start monitoring container 9tee4-dz642-unjdidp7zb08qdm in state \"Locked\"","time":"2021-07-02T17:17:56.426686121Z"}
{"PID":29877,"level":"info","msg":"container 9tee4-dz642-unjdidp7zb08qdm changed state from Locked to Running","time":"2021-07-02T17:19:27.781201740Z"}
{"PID":29877,"level":"info","msg":"container 9tee4-dz642-unjdidp7zb08qdm is done","time":"2021-07-02T17:19:28.828679722Z"}
{"PID":29877,"level":"info","msg":"Bkill(190)","time":"2021-07-02T17:19:29.389316198Z"}
{"PID":29877,"level":"info","msg":"container 9tee4-dz642-unjdidp7zb08qdm job disappeared from LSF queue","time":"2021-07-02T17:19:30.389644004Z"}
{"PID":29877,"level":"info","msg":"Done monitoring container 9tee4-dz642-unjdidp7zb08qdm","time":"2021-07-02T17:19:33.836542040Z"}
todo/tbd:
  • dispatching to docker doesn't work, even though docker-via-slurm works on the same compute nodes
    • "2021-07-02T15:15:27.602617730Z error in Run: While loading container image: While loading container image into Docker: Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Post "http://%2Fvar%2Frun%2Fdocker.sock/v1.21/images/load?quiet=1": dial unix /var/run/docker.sock: connect: permission denied"
    • my guess is bsub sets up the user's primary group (or the groups that user has on the submitting node), which wouldn't include docker in this case. I tried adding groups with "bsub -G docker" but that group doesn't exist on the submitting node, so: "Bad user group name. Job not submitted."
    • is this a blocker, or should we just document for now that lsf+docker isn't yet supported?
  • implement checkLsfQueueForOrphans (see checkSqueueForOrphans)
  • propagate arvados container priority to lsf job priority (strategy might depend on lsf config)
  • add doc page
  • add deb/rpm package
Actions #9

Updated by Tom Clegg over 1 year ago

17756-dispatch-lsf @ 8145fbe8e6ab99184fcd41dea042ede63e9ff0d5 -- developer-run-tests: #2566

todo:
  • update docs: add DispatchLSF: InternalURLs: "http://0.0.0.0:9009": true to Services section in /etc/arvados/config.yml
  • priority?
Actions #10

Updated by Tom Clegg over 1 year ago

  • Target version changed from 2021-07-07 sprint to 2021-07-21 sprint
Actions #11

Updated by Nico César over 1 year ago

review @ 3b63632698de9868a501191e8989f14c23e4e743

I think documentation for BsubArgumentsList should take the form of array [] instead of a list. This is correct in config.yml documentation. An even better option for sysadmins will be a string as in "-C 0.." since is likely they will copy and paste from some working integrations. Internally can be transformed to an array.

How does a non existant BsubSudoUser in compute node error comes up? I'm thinking of a newly added misconfigured compute node in an already working cluster. also will be good if there is a good way to check that a compute node is properly configured (maybe arvados-client diagnostics should have this chec?)

func (disp *dispatcher) runContainer() seems slightly convoluted the first block with a bunch of nested if's and else's and error handling is unclear to me. I like the x, err := foo ; if (err != nil) { return... } style because it gives a "line of sight" in the same column number all errors to be returned instead the 7 tabs/spaces while nested.

in runContainer, what will be the case that ok false? shoudn't we log this?

               case updated, ok := <-status:
                       if !ok {
                               done = true
                               break
                       }

Also I noticed that we check if Priority 0 {} else {} ; can Priority be negative at any point? I remember sometime we had issues with slurm and some negative priority cases, but that could all be something from the past.

Actions #12

Updated by Peter Amstutz over 1 year ago

Actions #13

Updated by Peter Amstutz over 1 year ago

  • Target version changed from 2021-07-21 sprint to 2021-08-04 sprint
Actions #14

Updated by Tom Clegg over 1 year ago

17756-dispatch-lsf @ efc1846a758929bdb57b87bdbb3f757f8907c69b -- developer-run-tests: #2602
  • Change example yaml formatting from list to array: BsubArgumentsList: ["-C", "0"]
  • Nested if's and else's addressed by moving the two copies of the "no suitable instance type" reporting code from c-d-slurm/lsf into sdk/go/dispatch.
  • Check Priority<1 instead of ==0 (just in case it happens, although it shouldn't).

I tried "sudo -E -u postgres bsub echo ok" on a test cluster and LSF seemed to just leave that job in PEND state (maybe waiting for a compute node to appear that has that user?). I don't see any complaints about it in the LSF logs. Not sure how hard we should try to help troubleshoot this, but at least I added a note to the config file comment & install doc page.

Actions #15

Updated by Nico César over 1 year ago

review @ efc1846a758929bdb57b87bdbb3f757f8907c69b

LGTM ready to merge

Actions #16

Updated by Tom Clegg over 1 year ago

  • % Done changed from 0 to 100
  • Status changed from In Progress to Resolved
Actions #17

Updated by Peter Amstutz about 1 year ago

  • Release set to 42
Actions

Also available in: Atom PDF