Project

General

Profile

Actions

Feature #17756

closed

Initial implementation of LSF dispatcher

Added by Peter Amstutz almost 3 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Story points:
-
Release relationship:
Auto

Description

Similar to crunch-dispatch-slurm. Should be packaged as a subcommand of arvados-server, though.

Use Arvados SDK "Dispatcher" class to interact with API server.

Start with basic dispatch to queue (bsub) and monitoring.

Stub out commands to support unit testing (as done with slurm).

Test on 9tee4


Subtasks 1 (0 open1 closed)

Task #17795: Review 17756-dispatch-lsfResolvedTom Clegg07/14/2021Actions

Related issues

Related to Arvados Epics - Idea #16304: LSF supportResolved04/01/202109/30/2021Actions
Actions #1

Updated by Peter Amstutz almost 3 years ago

  • Category set to Crunch
Actions #2

Updated by Tom Clegg almost 3 years ago

  • Description updated (diff)
Actions #3

Updated by Peter Amstutz almost 3 years ago

  • Description updated (diff)
Actions #4

Updated by Peter Amstutz almost 3 years ago

  • Description updated (diff)
Actions #5

Updated by Peter Amstutz almost 3 years ago

  • Assigned To set to Tom Clegg
Actions #7

Updated by Peter Amstutz almost 3 years ago

  • Target version changed from 2021-06-23 sprint to 2021-07-07 sprint
Actions #8

Updated by Tom Clegg almost 3 years ago

  • Status changed from New to In Progress
testing c1ca6ec2e on 9tee4:
  • add DispatchLSF: InternalURLs: "http://0.0.0.0:9009": true to Services section in /etc/arvados/config.yml
  • add Containers: RuntimeEngine: singularity in /etc/arvados/config.yml
  • systemctl stop crunch-dispatch-slurm
  • install ...path/to/arvados-server /usr/local/bin/arvados-dispatch-lsf
  • arvados-dispatch-lsf
  • submit a container request
{"Listen":"[::]:9009","PID":29877,"Service":"arvados-dispatch-lsf","URL":"http://0.0.0.0:9009/","level":"info","msg":"listening","time":"2021-07-02T17:17:50.230755112Z"}
{"PID":29877,"level":"warning","msg":"FIXME: checkLsfQueueForOrphans","time":"2021-07-02T17:17:50.230759598Z"}
{"PID":29877,"level":"info","msg":"Submitting container 9tee4-dz642-unjdidp7zb08qdm to LSF","time":"2021-07-02T17:17:56.390746354Z"}
{"PID":29877,"level":"info","msg":"bsub command [\"sudo\" \"-E\" \"-u\" \"crunch\" \"bsub\" \"-J\" \"9tee4-dz642-unjdidp7zb08qdm\" \"-R\" \"rusage[mem=757MB:tmp=640MB] affinity[core(1)]\"] script \"#!/bin/sh\\nexec 'crunch-run' '--runtime-engine=singularity' '-cgroup-parent-subsystem=memory' '9tee4-dz642-unjdidp7zb08qdm'\\n\"","time":"2021-07-02T17:17:56.390943355Z"}
{"PID":29877,"level":"info","msg":"bsub finished","stdout":"Job \u003c190\u003e is submitted to default queue \u003cnormal\u003e.\n","time":"2021-07-02T17:17:56.426583960Z"}
{"PID":29877,"level":"info","msg":"Start monitoring container 9tee4-dz642-unjdidp7zb08qdm in state \"Locked\"","time":"2021-07-02T17:17:56.426686121Z"}
{"PID":29877,"level":"info","msg":"container 9tee4-dz642-unjdidp7zb08qdm changed state from Locked to Running","time":"2021-07-02T17:19:27.781201740Z"}
{"PID":29877,"level":"info","msg":"container 9tee4-dz642-unjdidp7zb08qdm is done","time":"2021-07-02T17:19:28.828679722Z"}
{"PID":29877,"level":"info","msg":"Bkill(190)","time":"2021-07-02T17:19:29.389316198Z"}
{"PID":29877,"level":"info","msg":"container 9tee4-dz642-unjdidp7zb08qdm job disappeared from LSF queue","time":"2021-07-02T17:19:30.389644004Z"}
{"PID":29877,"level":"info","msg":"Done monitoring container 9tee4-dz642-unjdidp7zb08qdm","time":"2021-07-02T17:19:33.836542040Z"}
todo/tbd:
  • dispatching to docker doesn't work, even though docker-via-slurm works on the same compute nodes
    • "2021-07-02T15:15:27.602617730Z error in Run: While loading container image: While loading container image into Docker: Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Post "http://%2Fvar%2Frun%2Fdocker.sock/v1.21/images/load?quiet=1": dial unix /var/run/docker.sock: connect: permission denied"
    • my guess is bsub sets up the user's primary group (or the groups that user has on the submitting node), which wouldn't include docker in this case. I tried adding groups with "bsub -G docker" but that group doesn't exist on the submitting node, so: "Bad user group name. Job not submitted."
    • is this a blocker, or should we just document for now that lsf+docker isn't yet supported?
  • implement checkLsfQueueForOrphans (see checkSqueueForOrphans)
  • propagate arvados container priority to lsf job priority (strategy might depend on lsf config)
  • add doc page
  • add deb/rpm package
Actions #9

Updated by Tom Clegg almost 3 years ago

17756-dispatch-lsf @ 8145fbe8e6ab99184fcd41dea042ede63e9ff0d5 -- developer-run-tests: #2566

todo:
  • update docs: add DispatchLSF: InternalURLs: "http://0.0.0.0:9009": true to Services section in /etc/arvados/config.yml
  • priority?
Actions #10

Updated by Tom Clegg almost 3 years ago

  • Target version changed from 2021-07-07 sprint to 2021-07-21 sprint
Actions #11

Updated by Nico César almost 3 years ago

review @ 3b63632698de9868a501191e8989f14c23e4e743

I think documentation for BsubArgumentsList should take the form of array [] instead of a list. This is correct in config.yml documentation. An even better option for sysadmins will be a string as in "-C 0.." since is likely they will copy and paste from some working integrations. Internally can be transformed to an array.

How does a non existant BsubSudoUser in compute node error comes up? I'm thinking of a newly added misconfigured compute node in an already working cluster. also will be good if there is a good way to check that a compute node is properly configured (maybe arvados-client diagnostics should have this chec?)

func (disp *dispatcher) runContainer() seems slightly convoluted the first block with a bunch of nested if's and else's and error handling is unclear to me. I like the x, err := foo ; if (err != nil) { return... } style because it gives a "line of sight" in the same column number all errors to be returned instead the 7 tabs/spaces while nested.

in runContainer, what will be the case that ok false? shoudn't we log this?

               case updated, ok := <-status:
                       if !ok {
                               done = true
                               break
                       }

Also I noticed that we check if Priority 0 {} else {} ; can Priority be negative at any point? I remember sometime we had issues with slurm and some negative priority cases, but that could all be something from the past.

Actions #12

Updated by Peter Amstutz almost 3 years ago

Actions #13

Updated by Peter Amstutz almost 3 years ago

  • Target version changed from 2021-07-21 sprint to 2021-08-04 sprint
Actions #14

Updated by Tom Clegg almost 3 years ago

17756-dispatch-lsf @ efc1846a758929bdb57b87bdbb3f757f8907c69b -- developer-run-tests: #2602
  • Change example yaml formatting from list to array: BsubArgumentsList: ["-C", "0"]
  • Nested if's and else's addressed by moving the two copies of the "no suitable instance type" reporting code from c-d-slurm/lsf into sdk/go/dispatch.
  • Check Priority<1 instead of ==0 (just in case it happens, although it shouldn't).

I tried "sudo -E -u postgres bsub echo ok" on a test cluster and LSF seemed to just leave that job in PEND state (maybe waiting for a compute node to appear that has that user?). I don't see any complaints about it in the LSF logs. Not sure how hard we should try to help troubleshoot this, but at least I added a note to the config file comment & install doc page.

Actions #15

Updated by Nico César almost 3 years ago

review @ efc1846a758929bdb57b87bdbb3f757f8907c69b

LGTM ready to merge

Actions #16

Updated by Tom Clegg over 2 years ago

  • % Done changed from 0 to 100
  • Status changed from In Progress to Resolved
Actions #17

Updated by Peter Amstutz over 2 years ago

  • Release set to 42
Actions

Also available in: Atom PDF